Académique Documents
Professionnel Documents
Culture Documents
0
Oren Eini
Hibernating Rhinos
Contents
1 Introduction
1.1
What is this? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2
10
1.3
In this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3.1
Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3.2
Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3.3
Part III . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3.4
Part IV . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.3.5
Part V . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Distributed computing . . . . . . . . . . . . . . . . . . . . . . . .
12
1.4.1
12
1.4
Fallacies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part I
13
2 A little history
15
2.1
2.2
15
2.1.1
16
2.1.2
17
2.1.3
17
2.1.4
18
19
2.2.1
20
2.2.2
22
CONTENTS
2.3
2.4
2.2.3
23
2.2.4
23
2.2.5
23
25
2.3.1
Key/Value databases . . . . . . . . . . . . . . . . . . . .
25
2.3.2
Graph databases . . . . . . . . . . . . . . . . . . . . . .
25
2.3.3
Column databases . . . . . . . . . . . . . . . . . . . . .
26
2.3.4
Document databases . . . . . . . . . . . . . . . . . . . .
26
27
2.4.1
27
31
3.1
Setting up everything . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2
34
3.3
36
3.3.1
36
3.3.2
The session . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.3.3
Database commands . . . . . . . . . . . . . . . . . . . . .
52
3.3.4
53
54
3.4.1
55
3.5
Running in Production . . . . . . . . . . . . . . . . . . . . . . . .
56
3.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.4
4 RavenDB concepts
59
4.1
59
4.2
Collections
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.3
Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.4
Document Identiers . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4.1
64
4.4.2
64
4.4.3
High/low algorithm . . . . . . . . . . . . . . . . . . . . .
65
CONTENTS
4.4.4
Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.4.5
Semantic ids . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.4.6
72
4.5
ETags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
4.6
75
4.6.1
77
Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.7.1
HTTP Caching . . . . . . . . . . . . . . . . . . . . . . . .
80
4.7.2
Aggressive Caching . . . . . . . . . . . . . . . . . . . . . .
81
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.7
4.8
85
5.1
85
5.2
87
5.3
Streaming results . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.4
Bulk inserts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.5
95
5.5.1
96
5.5.2
Scripted patching . . . . . . . . . . . . . . . . . . . . . . .
98
5.5.3
Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6
Listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7
5.8
5.9
111
CONTENTS
113
6.1
6.2
6.3
6.3.2
6.3.3
6.3.4
6.3.5
6.3.6
6.4
6.5
Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.6
6.7
6.8
6.9
7.2
7.3
7.1.2
7.1.3
7.2.2
7.2.3
7.2.4
Projections . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.5
7.4
135
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
CONTENTS
8 Static indexes
157
8.1
8.2
8.3
8.4
8.3.1
8.3.2
8.5
Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.6
8.7
8.8
8.9
8.7.1
8.7.2
Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.8.1
8.8.2
8.8.3
8.8.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
183
9.1
9.2
Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.3
Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.4
9.5
Facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.6
Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.7
Part III
185
Part IV
187
CONTENTS
Chapter 1
Introduction
RavenDB is a 2nd generation document database, part of the NoSQL approach.
That may or may not mean anything to you. If it doesnt, here is the elevator
speech. A document database is a database that stores documents. Not Word
or Excel documents, but documents in the sense of structured information in
the form of self-contained data. Usually, a document is in JSON or XML format.
RavenDB is a database for storing and working with JSON data.
RavenDB is a 2nd-generation database because weve been able to observe what
everyone has been doing and learn from their mistakes. So RavenDB has a
number of guidelines that resulted in a very dierent experience than is usually
associated with NoSQL databases. For example, RavenDB is an ACID database,
unlike many other NoSQL databases. Also, it was designed explicitly to be easy
to use and maintain.
Im a developer at heart. That means that one of my favorite activities is writing
code. Writing documentation, on the other hand, is so far down the list of my
favorite activities that one could say it isnt even on the list. I do like writing
blog posts, and Ive been maintaining an active blog for over a decade now.
Documentation tends to be dry and, while informative, that hardly makes for
good reading (or interesting time writing it). RavenDB has quite a bit of documentation that tells you how to use it, what to do and why. This book isnt
about providing documentation; weve got plenty of that. A blog post tells a
story even if most of mine are technical stories. I like writing those, and it
appears that a large number of people also like reading them.
9
10
CHAPTER 1. INTRODUCTION
11
1.3.1 Part I
Chapter 2 introduces RavenDB, non-relational document stores, and the background story for RavenDB. Not only the technical details about what it does,
but what led to its existence, and what was so important that we had to create
a whole new database for it. If you are familiar with NoSQL databases and
their history, you can skip this chapter and come back to it later.
Chapter 3 focuses on setting up a RavenDB server from scratch, then starting
to work with the database. From there, we discuss how to use the RavenDB
Client API and what sort of infrastructure your application can use when taking advantage of RavenDB. What RavenDB actually does with your data and
storage engine choices are also covered. Finally, we talk briey about running
in production (which is covered in greater depth in Part IV).
Chapter 4 discusses RavenDB concepts, ranging from introductions of entities
and documents to concepts of collections and metadata. We go over document
identiers in detail, including all the common strategies to generate a document
ID and the implication of each. From there, we discuss etags and their use in
RavenDB, including caching and optimistic concurrency control.
Chapter 5 explores advanced client-side operations. We begin by demonstrating how we can automate common tasks via listeners and then ne-tune the
serialization process. Next is a review of result streaming and bulk inserts, as
the preferred ways to port a lot of data out of and into RavenDB eciently.
From there, we talk about partial document updates (patching), lazy request
optimizations, change notications and the use of result transformers.
1.3.2 Part II
Indexing
12
CHAPTER 1. INTRODUCTION
1.3.4 Part IV
Operations
1.3.5 Part V
Scale out
Part I
In this part, well learn:
13
14
Chapter 2
A little history
This is a technical book, but I believe origin stories are as important as operation
or design details. And so, I would like to tell you the actual story about how
RavenDB began. Im going to be talking about myself quite a lot for this section,
but well resume talking about RavenDB immediately after, I promise.
You can skip this chapter and go directly to the technical content if you wish,
but I do suggest reading it at some point.
16
Taken together, this means that ever since 2004, my job was largely to go to
clients improve the performance, stability, and scalability of their applications.
The problem with that was that at some point, I made a mistake. I was doing
a code review for two clients, and I sent each of them the report about the
issues found in the other clients code and how to resolve it. Mistakes like that
arent pleasant, but they happen. So, imagine my surprise when both the clients
didnt notice that they had a report about a completely dierent codebase before
I pointed it out. In fact, one client was actually fast enough to implement my
suggestions before I could get back to them. Although they did comment that
I was probably working on top of a dierent branch :-).
17
I blogged about those issues extensively, and most of the people who invited
me to look over their code were readers of my blog. By denition, then, they
werent stupid, careless or malicious. In fact, most of them were dedicated, hard
working and very good at their jobs.
18
another OSS project. I called it Rhino.DivanDB I think you can guess what
inspired me.
The problem is that pronouncing Rhino.DivanDB is pretty hard (try saying it
out loud several times). Eventually, I realized that I was actually calling it
RivanDB. From there, it was just a matter of nding a word that was close nd
made sense. In the end, this is how RavenDB was born.
19
production using RavenDB within four months! And the pick up since then has
been well above what I initially expected.
Oh well, Ill settle for building great databases, rather than realistic business
plans :-).
Today, the RavenDB Core Team has about 15 full time developers, and the
limiting factor is the quality we require from our people. RavenDB runs mission
critical systems from healthcare to banking from government institutions to
real estate brokerage systems. Several books has been written about RavenDB,
articles about it show up regularly in trade journals and you can hear talks in
user groups and conferences across the world.
Im quite happy with this situation, as you can imagine. And RavenDB is just
getting better
All of that said, the back story might be interesting to you or not, but you
arent here to read about the history of RavenDB. You are reading this because
you want to use RavenDB in your future. And that means that you need to
understand why youll want to do that
20
Skipping ahead again, by 1996 you could actually purchase a 2.83 GB drive for
merely $2,900. A car at that time would cost you $12,371. I could go on, but Im
sure that you get the point by now. Storage used to be expensive. So expensive
that it dominated pretty much every other concern that you can think of.
At the time of this writing, you can get a 6 TB drive for less than $300 2 . And
a 3 TB drive will cost you roughly $100. That is 2014 for roughly 30 cents per
gigabyte, and 1980 for 40,000 dollars per gigabyte 3 .
Even leaving this aside, we also have to consider the type of applications that
were written at that time. In the 80s and early 90s, the absolute height of user
interface was the master/details form. And the number of users you had could
usually be counted on a single nger.
That environment produced databases that were optimized to answer the kind
of challenges that prevailed at the time. Storage was expensive, so a major eort
was made to reduce storage as much as possible. Users time was very cheap
by comparisons, so trade os that meant that we could save some disk space at
the expense of making the user wait were good design decisions. Time passed,
machines got faster and cheaper, disk space became cheap, making the user wait
became unacceptable, but we are still seeing those tradeos today.
Those are mostly evident when you look at normalization, xed schemas and
the relational model itself.
21
22
customer information are all pointing at the same address. Updating the
address for the customer therefor will also update the address for all of its
orders. When well look at one of those orders, well not see the address that it
was shipped to, but the current customer address.
In the real world, Ive seen such things happen with payroll systems and
paystubs (payslips across the pond). An employee got married, and changed
her bank account information to the new shared bank account. The couple also
wanted to purchase a home, so they applied for a mortgage. As part of that,
they had to submit paystubs from the past several months. That employee
requested that HR department send her the last few stubs. When the bank
saw that there were paystubs made to an account that didnt even existed,
they suspected fraud, the mortgage was denied and the police was called. An
unpleasant situation all around.
The common response for showing this issue is that it is an issue of bad modeling
decisions (and I agree). The problem is that the appropriate model would mean
that each order has its own address id in the addresses table. That isnt a really
good idea, youll have to do additional joins to get the data. Combine that with
a real world model of even moderate complexity and the size and cost of the
model just explodes.
23
24
The usual scaling method for relational databases was to buy a bigger box.
Rinse & Repeat until you run out of bigger boxes, and at that point, you pretty
much stuck.
Since my day job is building databases, let us assume that I got the requirement
to build a relational database that would allow distribution of data among multiple nodes. The rst thing to do would be to create a table with a primary key.
We can just decide that certain key ranges would go to certain nodes, and we
can move on. That does raise the issue of what to do when a node is down. I
will not be able to read or write any rows which fall in this node range.5
Well ignore this problem for now and try to implement the next feature, a
unique constraint. That is required so I cant have multiple users with the same
email. But this is just making things that much harder again. Now every insert
or update to a users email will require us to talk to the other nodes. And what
happens if one of the nodes is unavailable? At this point, I dont know if I have
this email or not. I might have, if it is located in the node that I cannot access.
Well ignore this problem as well, and just assume that we have competent and
awesome DBAs and that nodes can never go down. What is the cost of making a
query? Well, simple queries I can probably route to a single node. But complex
queries?
Consider the following query, using the schema we have seen in Figure 3.
SELECT * FROM Orders o
JOIN O r d e r L i n e s o l
JOIN Products p
JOIN A d d r e s s e s a
JOIN Customers c
WHERE o . Id = 7331
on
on
on
on
o l . OrderId = o . Id
o l . ProductId = p . Id
o . A d d re s sI d = a . Id
o . CustomerId = c . Id
In a system with multiple nodes, how costly do you think this query is going to
be? This is a pretty simple query, but it still likely to require us to touch multiple
nodes. Even ignoring the option for failure, this is a ridiculously expensive way
to do business. And the more data you have, the more nodes you have, the
more expensive this become.
As the need for actual web-scale grew, people have noticed that it is not really
possible to scale out a relational database. And there is only so much scaling
up that you can do. This is where the need for alternative solution came into
play. Thus the need for NoSQL.
5 Yes, we can avoid this by duplicating the data on multiple nodes, but that just move the
problem around, instead of one node being down, we need two or three nodes down to have
the same eect.
25
Key/Value Databases
Graph Databases
Column Family Databases
Document Databases
simplied view, Im aware, but good enough for our purposes right now.
26
A major aw is scaling such systems is that most graphs tend to be highly inter
connected, and it is very hard to isolate an independent sub graph and break it
into a separate machine. Consider the classic six degrees of separation theory,
and that the average distance between any two random Twitter user is less than
four.
Because of that, the use of graph databases is usually limited to just the associations that need to be handled. Most of the actual data is stored in another
type of database.
8 Online
27
Note that most of those design goals are actually just dierent ways to say the
same thing. Basically, the goal of RavenDB is that it Get Out of Your Way and
Just Works.
28
running on Windows. And most NoSQL solutions either at out couldnt run,
or could run, but only as alpha quality software. My goal was to create a really
good database that .NET developers could use. As it turned out, we managed
to do quite a bit more than that.
As a small example of that. Here are the installation instructions for RavenDB:
You now have a running RavenDB database, and you can browse to (by default)
http://localhost:8080/ to access the RavenDB Studio. The point in these instructions isnt so much to explain how to install RavenDB, well touch on that
later. The point is that this is literally the smallest number of steps that we
could get to getting RavenDB up & running.
This is probably a good time to note that this is actually how we deploy to our
own production environment. As part of our dogfooding eort, we always run
our systems using the default conguration, to make sure that RavenDB can
optimize itself for our needs automatically.
Most people get excited when they see that RavenDB ships with a fully functional management studio. There is no additional tool to install, just browse
to the database URL and you can start working with your data. To be honest,
even though we invested a lot of time and eort in the studio, that is quite
insulting. Weve spent even more time making sure that the actual database
engine is pretty awesome, and people get hung up on the UI.
A picture is worth a thousand words, and I think that Figure 2 can probably
help you understand what we didnt want to have.
When I initially started looking at the NoSQL landscape, there were a lot of
really great ideas, and some projects that really picked up traction. But all of
them were focused on solving the technical details, let us create a database that
can do this and that. This resulted in expert tools. The kind that you could
do really great things with, but only if you were an expert. If you werent an
expert, however, those kind of tools would be worse than useless. You might
end up removing more than just your foot, trying to use them.
The inspiration behind RavenDB was simple. It doesnt have to be so hard.
RavenDB was conceived and designed primarily so we can take the essence
behind the NoSQL movement and translate that to the kind of tooling that you
can use even without spending three months of your life learning the ins and
outs of your tooling. This was done, ruthlessly nding all points of friction and
eliminating them with extreme prejudice.
29
30
Chapter 3
33
35
37
{
var documentStore = new DocumentStore
{
Url = h t t p : / / l o c a l h o s t : 8 0 8 0 ,
D e f a u l t D a t a b a s e = Northwind ,
};
documentStore . I n i t i a l i z e ( ) ;
r e t u r n documentStore ;
}
p u b l i c s t a t i c IDocumentStore S t o r e
{
g e t { r e t u r n _ s t o r e . Value ; }
}
}
The use of Lazy ensures that the document store is only created once, without
having to worry about double locking or explicit thread safety issues. And
we can congure the document store as we see t. The rest of the code has
access to the document store using DocumentStoreHolder.Store. That should
be relatively rare, since apart from conguring the document store, the majority
of the work is done using the session. But before we get to that, let us see what
sort of conguration we can do with the document store.
3.3.1.1 Conventions
The RavenDB Client API, just like the rest of RavenDB, aims to Just Work.
As a result of that, it is based around the notion of conventions. A series of
policy decisions that has already been made for you. Those range from deciding
which property holds the document id to how the entity should be serialized to
a document.
For the most part, we expect that youll not have to touch the conventions. A
lot of thought and eort has gone into ensuring that youll have no need to do
that. But there is simply no way that we can foresee the future, or answer every
need, which is what pretty much every part of the client API is customizable.
Most of that is handled via the DocumentStore.Conventions property, by registering your own behavior. For example, by default the RavenDB Client API
will use a property named Id (case sensitive) to store the document id. But
there are users who want to use the entity name as part of the property name.
So well have OrderId for orders, ProductId for products, etc.2 .
Here is how we can tell the RavenDB Client API that it should use this behavior:
2 Ill
39
Url=h t t p : // l o c a l h o s t : 8 0 8 0 ;
User=beam ; Password=up ; Database=S c o t t y />
<add name=Embedded c o n n e c t i o n S t r i n g= DataDir=~\Northwind />
</ c o n n e c t i o n S t r i n g s>
In this manner, you can modify which server and database your client application will talk to by just modifying the conguration. You might also have
noticed that we have an embedded connection string as well, what is that?
3.3.1.3 Document store types
RavenDB can run in several modes. The most obvious one is when you run it as
a console application and communicate with it over the network. In production,
you do pretty much the same thing, except that youll run RavenDB in IIS
or as a Windows Service. This is great for building server applications, where
you want independent access to your database, but there are other options with
RavenDB.
You can run RavenDB as part of your application, embedded inside your own
process. If you want to do that, just use the EmbeddableDocumentStore class,
instead of DocumentStore. You can even congure the EmbeddableDocumentStore
to talk to a remote server or an embedded database just by changing the
connection string. The main advantage of using an embedded RavenDB
instance is that you dont need separate deployment or administration. There
is also no need to traverse the network to access the data, since it lives inside
the same process as you own application.
This option is particularly attractive for teams building low overhead system
or business applications that are deployed client side. Octopus Deploy is an
automated deployment system that makes use of RavenDB in just such a manner.
Even if you use it, youre probably not aware that it is using RavenDB behind
the scenes, since that is all internal to the application.
On the other side, you have NServiceBus, which also makes heavy use of
RavenDB, but usually does so in a server mode. So youll install RavenDB as
part of your NServiceBus deployment and manage it as an independent service.
From coding perspective, there is very little dierence between the two. In fact,
even in embedded mode, you are going through the same exact code paths youll
be going when talking to a remote database, except that there is no networking
involved.
3.3.1.4 Authentication
A database holds a lot of information, and usually it is pretty important that
youll have control over who can access that information and what they can do
Server
that this is a server level option, rather than a database level option
can think about this as Windows Authentication and SQL Authentication in SQL
41
43
// c r e a t i n g a new p r o d uc t
s t r i n g productId ;
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p r o d u c t = new Product
{
Category = Awesome ,
Name = RavenDB ,
S u p p l i e r = H i b e r n a t i n g Rhinos ,
};
s e s s i o n . S t o r e ( p r o du c t ) ;
p r o d u c t I d = p r o d uc t . Id ;
s e s s i o n . SaveChanges ( ) ;
}
// l o a d i n g & m o d i f y i n g t h e p r o du c t
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p = s e s s i o n . Load<Product >( p r o d u c t I d ) ;
p . R e o r d e r L e v e l ++;
s e s s i o n . SaveChanges ( ) ;
}
There are several interesting things in Listing 3.5. Look at the Store() call in
in line 11, immediately after that call, we can already access the document id,
even though we didnt save the change to the database yet. Next, on line 19, we
load the entity from the document, update the entity and call SaveChanges().
The session is smart enough to understand that the entity has changed and update the matching document on the server side. You dont have to call Update()
method, or anything of this sort. The session keeps track of all the entities you
have loaded, and when you call SaveChanges(), all the changes to those entities
are sent to the database in a single remote call.
Load()
Include()
Delete()
Query()
Store()
SaveChanges()
Advanced
Those are the most common operations that youll run into on a day to day
basis. And more options are available in the Advanced property.
3.3.2.1 Load
As the name implies, this gives you the option of loading a document or a set
of documents into the session. A document loaded into the session is managed
5 See
Release It!, a wonderful book that heavily inuenced the RavenDB design
45
by the session, any changes made to the document would be persisted to the
database when you call SaveChanges. A document can only be loaded once in
a session. Lets look at the following code:
var p1 = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
var p2 = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
A s s e r t . True ( Object . R e f e r e n c e E q u a l s ( p1 , p2 ) ) ;
Even though we call Load<Product>(products/1) twice, there is only a single
remote call to the server, and only a single instance of the Product class. Whenever a document is loaded, it is added to an internal dictionary that the session
manages. Whenever you load a document, the session check in that dictionary
to see if that document is already there, and if so, it will return the existing
instance immediately. This helps avoid aliasing issues and also generally helps
performance.
For those of you who deals with patterns, the session implements the Unit of
Work and Identity Map pattern. This is most obvious when talking about the
Load operation, but it also applies to Query and Delete.
Load can also be used to read more than a single document at a time. For
example, if I wanted three documents, I could use:
Product [ ] p r o d u c t s = s e s s i o n . Load<Product >(
products /1 ,
products /2 ,
p r o d u c t s /3
);
This will result in an array with all three documents in it, retreived in a single
remote call from the server. The positions in the array match the positions of
the ids given to the Load call.
You can even load documents belonging to multiple types in a single call, like
so:
o b j e c t [ ] i t e m s = s e s s i o n . Load<o b j e c t >(
products /1 ,
c a t e g o r i e s /2
);
Product p = ( Product ) i t e m s [ 0 ] ;
Category c = ( Category ) i t e m s [ 1 ] ;
A missing document will result in null being returned. Both for loading a single
document and multiple documents (a null will be returned in the id position).
The session will remember that couldnt load that document, and even if asked
47
p r o d u c t s / 1 : Chai
Category : Beverages , S o f t d r i n k s , c o f f e e s , t e a s , b e e r s , and a l e s
This results in the right output, but we have to go to the server twice. That
seems unnecessary. We cannot use the Load overload that accepts multiple
ids, because we dont know ahead of time what the value of the Category will
be. What we can do is ask RavenDB to help us. Well change the rst line of
code to be:
var p r o du c t = s e s s i o n . I n c l u d e <Product >(x => x . Category )
. Load ( p r o d u c t s / 1 ) ;
The rest of the code will remain unchanged. This single change has profound
eect on the way the system behaves because it tells the RavenDB server to do
the following:
RavenDB can do that because the reply to a Load request has two channels to
it. One channel for the actual results (products/1 document) and another for
all the includes (categories/1 document).
The session know how to read this included information and store that separately.
When the Load<Category>(categories/1) call is made, we can retrieve that
data directly from the session cache, without having to go to the server. This
can save us quite a bit on the number of remote calls we make.
Includes arent joins
It is tempting to think about Includes in RavenDB as similar to a
join in a relational database. And there are similarities, but there are
fundamental dierences. A join will modify the shape of the output,
it combines each match row from one side with each matching row
on the other, sometimes creating Cartesian Products that are can
cause night sweats for DBAs.
And the more complex your model, the more joins youll have, the
wider your result sets become, the slower your application will become. In RavenDB, there is very little cost to adding includes. That
is because they operate on a dierent channel than the results of the
operation.
Includes are also important in queries, and there they operate after
paging has applied, instead of before paging, like joins.
49
s e s s i o n . D e l e t e <Product > ( 1 ) ;
s e s s i o n . Delete ( orders /1);
It is important to note that calling Delete doesnt actually delete the document. It merely marks that document as deleted in the session. It is only when
SaveChanges is called that the document will be deleted.
3.3.2.4 Query
Querying is a large part of what a database does. Not surprisingly, queries
strongly relate to indexes, and well talk about those extensively in Chapter 5
and 6. In the meantime, let us see how we can query using RavenDB.
L i s t <Order> o r d e r s = (
from o i n s e s s i o n . Query<Order >()
where o . Company == companies / 1 )
select o
) . ToList ( ) ;
RavenDB is taking full advantage of Linq support in C#. This allows us to
express very natural queries on top of RavenDB in a strongly typed and safe
manner.
Because well dedicate quite a bit of time to talking about queries and indexes
later on, Ill be brief. Queries allow us to load documents that match a particular predicate. Like documents loaded via the Load call, documents that were
loaded via a Query are managed by the session. Modifying them and calling
SaveChanges will result in their update on the server.
And like the Load call, Query also supports include:
L i s t <Order> o r d e r s = (
from o i n s e s s i o n . Query<Order >()
. I n c l u d e ( x=>x . Company )
where o . Company == companies / 1 )
select o
) . ToList ( ) ;
You can now call Load<Company> on those companies and they will be served
directly from the session cache.
Queries in RavenDB dont behave like queries in a relational database. RavenDB
does not allow computation during queries, and it doesnt have problems with
table scans. Well touch on exactly why and the details about indexing in the
Chapters 5 and 6, but for now you can see that most queries will just work for
you.
51
}
So SaveChanges is called only once per session. In web scenarios, this is typically handled in the controller. Listing 3.6 shows example of base RavenDB
controllers for ASP.NET Web API and ASP.NET MVC. Both samples show
a common pattern for working with RavenDB, we have the infrastructure (in
this case, the base controller) take care of opening the session for us, as well as
calling the SaveChanges method if there has been no errors.
Listing 3.6: Base controllers classes for RavenDB
p u b l i c a b s t r a c t c l a s s BaseRavenDBController : C o n t r o l l e r
{
p u b l i c IDocumentSession DocumentSession { g e t ; s e t ; }
p r o t e c t e d o v e r r i d e v o i d OnActionExecuting (
ActionExecutingContext f i l t e r C o n t e x t )
{
DocumentSession = DocumentStoreHolder . S t o r e .
OpenSession ( ) ;
}
p r o t e c t e d o v e r r i d e v o i d OnActionExecuted (
ActionExecutedContext f i l t e r C o n t e x t )
{
u s i n g ( DocumentSession )
{
i f ( DocumentSession == n u l l | |
f i l t e r C o n t e x t . E x c e p t i o n != n u l l )
return ;
DocumentSession . SaveChanges ( ) ;
}
}
}
p u b l i c a b s t r a c t c l a s s BaseRavenDBApiController : A p i C o n t r o l l e r
{
p u b l i c IAsyncDocumentSession DocumentSession { g e t ; s e t ; }
p u b l i c o v e r r i d e async Task<HttpResponseMessage> ExecuteAsync (
H t t p C o n t r o l l e r C o n t e x t ctx ,
CancellationToken cancel )
{
u s i n g ( var s e s s i o n = DocumentStoreHolder . S t o r e . OpenAsyncSession ( ) )
{
var message = a w a i t b a s e . ExecuteAsync ( ctx , c a n c e l ) ;
a w a i t s e s s i o n . SaveChangesAsync ( ) ;
53
with the low level stu on a regular basis. But when you do, this is where the
database commands come into place.
RavenDB is exposed over the network using a REST API. And you can absolutely make use of REST calls directly. We have several customers that are
using REST calls from PowerShell to administer RavenDB. That is ne, and
works great, but usually we can do better.
The Database Commands expose a low level API against RavenDB that is much
nicer than raw REST calls. For example, I might want to check if a potentially
large document exists, without loading it. I can do that using:
var cmds = DocumentStoreHolder . S t o r e . DatabaseCommands ;
var docMetadata = cmds . Head ( p r o d u c t s / 1 ) ;
i f ( docMetadata != n u l l )
C on s o l e . WriteLine ( document e x i s t s ) ;
You can use the Database Commands to get the database statistics, generate
identity values, get the indexes and transformers on the server, issue patch
commands, etc.
The reason that they are exposed to you is that the RavenDB API, at all levels,
is built with the notion of layers. The expectation is that youll usually work
with the highest layer, the session API. But since we cant predict all things,
we also provide access to the lower level API, on top of which the session API
is built, so you can fall down to that if you need to.
For the most part, that is very rarely needed, but it is good to know that this
is available, just in case.
55
Esent stands for the Extensible Storage Engine (ESE), and it is also known as
JET Blue. It is a core component in Windows, and forms the basis for services
such as Active Directory, Exchange and many other Windows components. It is
a robust and production tested storage engine, and has been used in RavenDB
since the very start.
Voron (Russian for Raven) is an independently developed storage engine that
was created by Hibernating Rhinos. It takes a lot from LevelDB and LMDB,
but its internal structure is quite dierent. It is providing high performance read
and writes and has full ACID support. Voron is our next generation storage
technology, and it lies at the core of several upcoming features for RavenDB.
When you create a new database with RavenDB, you have the option of selecting
the storage engine. Before RavenDB 3.0, you could had Esent as the storage
engine. Now you have to make a choice. Esent has more time in the eld, and in
general has been proven to be a pretty good choice. It does suer from several
issues, chief among them is that you cannot easily move the database between
machines. That is because Esent is tied to the Windows version, so you cant
take a database from Windows 2012 server and open it on a Windows 8 machine.
Another issue is that Esent is tied to the actual machine locale, and may require
a defrag state when moving between machines with dierent locales.
Voron, on the other hand, was built to avoid all those issues, and you can move
it between machines with no issues. It also tend to be faster than Esent for
most purposes. Voron is optimized for 64 bits, and while it can run in 32 bits
systems, its database size is very limited in those scenarios. Voron has also a
lot less real world experience than Esent.
The conservative choice would be to go with Esent for the time being, even
though Voron is what we are aiming at in the future. New features in RavenDB
(distributed counters, event storage, etc.) are coming down the pipe that will
be Voron only. And, of course, Hibernating Rhinos own internal systems are
running using Voron.
3.6. SUMMARY
57
3.6 Summary
In this chapter, we talked about getting started with RavenDB. We installed
RavenDB from scratch, then talked to it using the Document Store and Document Session. We have also explored the client API, what it can do and how to
best utilize it.
You should have a single document store instance in your application, and use
that to create a single session per request (or per message). Weve also seen
some sample code to handle that for common scenarios such as ASP.Net Web
API and ASP.Net MVC. We covered how to do basic CRUD with RavenDB
using the session and explored the layered structured of RavenDB.
The session is the highest layer of the Client API, then we have the database
commands and nally the raw REST calls over HTTP. That layered architecture
is also present on the server side, and one such example is the storage engines
that we looked at, Esent and Voron.
We touched briey on running RavenDB in production, just to get you started,
and well talk about this a lot more in Chapter 8, Operations.
In the next chapter, well talk more about various concepts inside RavenDB.
Well see how everything is put together, and how you can best take advantage
of that.
Chapter 4
RavenDB concepts
We have a running instance of RavenDB, and we have already seen how we can
put and get information out of our database. But we are still only just scratching
the surface of what we need to know to make eective use of RavenDB. In this
chapter, well go over the major concepts inside RavenDB.
The rst step along the way is to understand what documents are.
60
independently. Instead, the order lines will be embedded inside the order. Thus,
whenever we want to load the order, well get all of the order lines with it.
And modication to an order line (be it updating, removing or adding) is a
modication to the order as a whole, as it should be.
The order line is now a Value Type, an object that only has meaning within
its parent object, not independently. This has a lot of interesting implications.
You dont have to worry about Coarse Grain Locking1 or partial entity updates.
Rules like external references should only be to aggregates are automatically
enforced, simply because documents are aggregates, and they are the only thing
you can reference.
Documents are independent and coherent. What does those mean? When
designing the document structure, you should strive toward creating a document
that can be understood in isolation. You should be able to perform operations
on a document by loading that single document and operating on it alone. It
is rare in RavenDB to need to reference additional documents during write
operations. That is enough modeling for now, well continue talking about that
in the next chapter. Now we are going to go beyond the single document scope,
and look at what a collection of documents are.
4.2 Collections
On the face of it, it is pretty easy to explain collections. See Figure 1 as a good
example.
It is tempting to think about collections as a set of documents that has the same
structure and are stored in the same location. That is not the case, however.
Two documents in the same collection can be utterly dierent from one another
in their internal structure. See Figure 2 for one such example.
Because RavenDB is schemaless, there is no issue with doing this, and the
database will accept and work with such documents with ease. This allows
RavenDB to handle dynamic and user generated content without any of the
hard work that is usually associated with such datasets. It is pretty common to
replace EAV2 systems with RavenDB, because it make such systems very easy
to build and use.
RavenDB stores all the documents in the same physical location, and
the collection association is actually just a dierent metadata value. The
RavenEntityName metadata value controls which collection a particular
1 You might notice a lot of terms from the Domain Driven Design book used here, that is
quite intentional. When we created RavenDB, we intentionally made sure that DDD applications would be a natural use case for RavenDB.
2 Entity-Attribute-Value schemas, the common way to handle dynamic data in relational
databases. Also notorious for being hard to use, very expensive to query and in general a
trouble area you dont want to go into.
4.2. COLLECTIONS
61
62
document will belong to. Being a metadata value, it is something that is fully
under you control.
Collections & document identiers
It is common to have the collection name as part of the document
id. So a document in the Products collection will have the id of
products/. That is just a convention, and you can have a document in the Products collection (because its metadata has the
RavenEntityName value set to Products) while it has the name
bluebell/buttery.
RavenDB does use the collections information to optimize internal operations.
Changing the collection once the document is created is not supported. If you
need to do that, youll need delete the document and create it with the same id,
but a dierent collection.
Weve talked about the collection value in the metadata, but we havent actually
talked about what is the metadata. Let talk meta.
4.3 Metadata
The document data is composed of whatever it is that youre storing in the
document. For the order document, that would be the shipping details, the
order lines, who the customer is, the order priority, etc. But you also need a
place to store additional information, not related to the document itself, but
about the document. This is where the metadata comes into place.
The metadata is also a JSON format, just like the document data itself. However,
there are some limitations. The property names follow the HTTP Headers
convention of being Pascal-Cased. In other words, we separate words with a
dash and the rst letter of each word is capitalized, everything else is in lower
case. This is enforced by RavenDB.
RavenDB uses the metadata to store several pieces of information about the
document that it keeps track of:
The collection name - stored in the RavenEntityName metadata property.
The last modied date - stored in the LastModied metadata property3 .
The client side type - stored in the RavenClrType metadata property.
The etag - stored in the @etag metadata property, and discussed at length
later in this chapter.
3 This is actually stored twice, once as LastModied and once as RavenLastModied,
the rst is following the RFC 2616 format and is only accurate to the second. The second is
accurate to the millisecond.
63
You can use the metadata to store your own values, for example, LastModiedBy
is a common metadata property that is added when you want to track who
changed a document. From the client side, you can access the document
metadata using the following code:
Product p r o d u c t = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
RavenJObject metadata = s e s s i o n . Advanced . GetMetadataFor ( p r o d u c t ) ;
metadata [ LastModified By ] = c u r r e n t U s e r . Name ;
It is important to note that there will be no extra call to the database to fetch
the metadata. Whenever you load the document, the metadata is fetched as
well. In fact, we usually need the metadata to materialized the document into
an entity.
Changing a document collection
RavenDB does not support changing collections. While it is possible
to change the metadata value for RavenEntityName, doing so is
going to cause issues.
We have a lot of optimizations internally to avoid extra work based
on the collection name, and no support whatsoever for changing it.
Weve tried to add support for this, or even just at out error when
/ if you try to change the collection name, but either option proved
to be too expensive.
When you save a document, RavenDB can just throw the data into
the disk as fast as possible, needing to check for a previous collection
name has proven to be very expensive (needing to do a read per
write) and hurt our performance, especially in bulk insert mode.
If you need to change a documents collection, the supported way to
do that is to delete it and then save it again, with the same document
id.
Once you have the metadata, you can modify it as you wish, as seen in the last
line of code. The session tracks changes to both the document and its metadata,
and changes to either one of those will cause the document to be updated on
the server once SaveChanges has been called.
Modifying the metadata in this fashion is possible, but it is pretty rare to do
so explicitly in your code. Instead, youll usually use listeners to do this sort of
work.
64
Unlike a primary key, which is unique per table, all the documents in a database
share the same key space4 .
Identiers terminology
Document identiers are also called document keys, or just ids or
keys. In the nomenclature of RavenDB, we use both keys and ids to
refer to a document id.
65
This approach has several advantages. It tends to generate small and sequential
keys, and most importantly, these type of keys are human readable and easily
understood.
The question now is, how do we get this numeric sux?
Raven/Hilo/categories
Raven/Hilo/companies
Raven/Hilo/employees
Raven/Hilo/orders
Raven/Hilo/products
Raven/Hilo/regions
Raven/Hilo/shippers
Raven/Hilo/suppliers
Those are pretty trivial documents, they all have just a single property, Max.
That propertys value is the maximum possible number that has been generated
(or will be generated) for that collection. When we need to generate a new
identier for a particular collection, we fetch that document and get the current
max value. We then add to that max value and update the document.
We now have a range, between the old max value and the updated max value.
Within this range, we are free to generate identier with the assurance that no
one else can generate such an identier as well.
The benet of this approach is that this also generates roughly sequential keys,
even in the presence of multiple clients generating identiers concurrently.
66
67
document back after raising the Max value, we do so in a way that would throw
a ConcurrencyException if the document has been changed in the meantime. If
we get this error, we retry the entire process from the beginning, fetching the
document again, recording the current Max value, then saving with the new
Max.
This way, we are protected against multiple clients overwriting one anothers
changes and generating duplicate ids.
4.4.3.3 Distributed hilo
Using the hilo algorithm, we only have to go back to the database once we run
out of ids in the range we reserved. But what happens when we cannot contact
that database? Well touch on distribution model later on, in Part 3, Scale Out,
but I do want to expand on how this relates to hilo at this time.
Assuming that we have a RavenDB cluster made of 3 nodes. We will congure
each node to have its own unique hilo prex. This way, if the primary node is
down, we can still reserve ranges, and we dont have to worry about reserving
the same range as another client because of a network failover.
Well discuss such scenarios extensively in Part 3, for now, all you really care
about is that you can use the hilo system in a cluster without worrying about
a single point of failure.
4.4.3.4 Manual hilo
You dont need to do anything special to use the hilo algorithm. It is what the
RavenDB Client API does by default. It generate ids that have the following
format:
orders/13823
products/7371
PackageTracking/38248225
But sometimes you want to just have the numeric id and work with that. Maybe
you are working with internal ids (see Chapter 5, Modeling) or using semantic
ids (see a bit later in this chapter) but for whatever reason, you want to be able
to generate those hilo values yourself.
You dont need to start implementing everything from scratch. You can just
write the following code:
5 When the entity name is composed of a single word, well default to lower casing it, when
it is composed of multiple words, well preserve the casing.
68
69
4.4.4 Identity
If you really need to have consecutive ids, you can use the identity option.
Identity, just like in a relational database (sometimes called sequence) is a simple
always incrementing value. Unlike the hilo option, you always have to go to the
server to generate such a value.
There are two ways to generate an identity value. The rst is to do so implicitly,
as shown in Listing 4.
Listing 4.1: Using implicit identity
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p r o d u c t = new Product
{
Id = p r o d u c t s / ,
Name = What s my i d ?
};
s e s s i o n . S t o r e ( p r o du c t ) ;
C on s o l e . WriteLine ( p r o d u c t . Id ) ;
// output : p r o d u c t s /
s e s s i o n . SaveChanges ( ) ;
C on s o l e . WriteLine ( p r o d u c t . Id ) ;
// output : p r o d u c t s /78
}
You can see that we actually dene an id for the product. But a document
id that ends with a slash (/) isnt allowed in RavenDB. We treat such an id
as an indication that we need to generate an identity. That has an interesting
implication.
How identities are stored?
Identities are stored6 as a list of tuples containing the identity key
and the last value. This data is persistent. In other words, if you
delete the latest document, you wont get the same id back.
As a result of that, identities are actually created lazily, the rst
time we need them, and the rst value generated is always one and
there is no way to set a dierent step size for identities. This raises
the question, what happens if we create a document with the id
products/1 manually, then try to save a document with the id
products/?
RavenDB is smart enough to recognize this scenario, and it will
generate a non colliding id in an ecient manner. In this case, well
get the id products/2
6 This
70
We dont go to the server until we are actually calling SaveChanges. That means
that we dont know what the actual document id is until after we already got
the reply from the server. That isnt fun, but on the other hand, we can save
multiple documents using identity without having to go to the server for each
of them individually.
The other way to use identities is to do so explicitly. You can do that using the
following code:
l o n g n e x t I d e n t i t y = documentStore . DatabaseCommands
. NextIdentityFor ( i n v o i c e s ) ;
This allows you to construct the full document id on the client side. But it does
require two trips to the database, one to fetch the identity value and the second
to actually save it. There is no way to get multiple identity values in a single
request.
You can set the identity next value using this command:
l o n g n e x t I d e n t i t y = documentStore . DatabaseCommands
. SeedIdentityFor ( i n v o i c e s , 654);
Invoices, and other tax annoyances
For the most part, unless you are using semantics ids (covered later in
this chapter), you shouldnt care what your document id is. The one
case you care is when you have an outside requirement to generate
absolute consecutive ids. One such common case is when you need
to generate invoices.
Most tax authorities have rules about not missing invoice numbers,
to make it just a tad easier to actual audit your system. But an
invoice documents identier and the invoice number are two very
dierent things.
It is entirely possible to have the document id of invoices/843 for
invoice number 523.
4.4.4.1 The downsides for identity
There is no such thing as a free lunch, and identity also has its own set of drawbacks. Chief among them is that identities are actually not stored as a document.
Instead, they are stored internally in a way that isnt quite so friendly.
That means that exporting and importing the database would not also carry over
the identities values. The identities values are also not replicated, so identity
isnt suitable for use in a cluster.
Finally, modifying an identity happens in a separate transaction than the current transaction. In other words, if we try to save a document with the name
71
product/, and the transaction failed, the identity value is still incremented.
So even though identity generate consecutive numbers, it might still skip ids if
a transaction has been rollbacked.
Except for very specic requirements, such as an actual legal obligation to generate consecutive numbers, I would strongly recommend not using identity. Note
my wording here, a legal obligation doesnt arise because someone want consecutive ids because they are easier to grasp. Identity has a real cost associated
with it.
72
this
can set the convention option AllowQueriesOnId to allow that if you really require
4.5. ETAGS
73
And that is quite enough about document identiers. Well now move to the
other crucial piece of information that every document has, the ETag.
4.5 ETags
An etag in RavenDB is a 128 bit number that is associated with a document.
Whenever a document is created or updated, an etag is assigned to that document. The etags are always incrementing, and they are heavily used inside
RavenDB. Among their usages:
An etag is associated with the document metadata, so whenever you load the
document, you also have the etag available for you. Retrieving the etag is easy,
all you have to do is:
Etag productEtag = s e s s i o n . Advanced . GetEtagFor ( p r o d u c t ) ;
On the client side, etags are used for optimistic concurrency control and for
caching. Well touch on optimistic concurrency in the next section, and caching
is the section after that. I want to focus on how we are using etags on the server
side.
The structure of an ETag
There is just one promise that we make about etags, and that
promise is that they are always incrementing. Anything else
is an implementation details. That said, it can be an interesting implementation detail. Let us take a look at an etag:
01000000000000010000000000000EB6.
This looks like a Guid, and indeed, this is a 128 bit number, which
is using the Guid format convention because it is convenient. This
is actually composed of the following parts:
(1) 01
(2) 000000-0000-0001
(3) 0000-000000000EB6
74
75
76
77
session will save any updated to it with optimistic concurrency13 . There arent
many scenarios where you actually want to have Last Write Wins for a particular
document and optimistic concurrency for another. This feature was created to
enable end-to-end optimistic concurrency.
i t t o John
i t Martha
from John
, update i t p e r John s i n s t r u c t i o n and s a v e i
from Martha
, update i t p e r Martha s i n s t r u c t i o n and s a v e
The problem is in how we dene what changed. In the scenario above, optimistic
concurrency is within the scope of a single session. In Marthas update case, we
loaded the document (which already had Johns update) and then saved it with
Marthas update. There is no problem as far as the database is concerned. The
problem is that Martha never saw Johns update, and as far as John & Martha
are concerned, this is just another case of Last Write Wins.
In order to handle this scenario from an end to end perspective, we need to
send to the client the etag of the document as well as the actual document
itself. And when we request an update to the document, well need to pass
along the original etag value. Listing 5 shows the server side code for end to
end concurrency using ASP.Net MVC.
Listing 4.2: End to end optimistic concurrency in RavenDB
p u b l i c A c t i o n R e s u l t LoadProduct ( i n t i d )
{
Product p r o d u ct = DocumentSession . Load<Product >( i d ) ;
Etag e t a g = DocumentSession . Advanced . GetEtagFor ( p r o d u c t ) ;
r e t u r n Json ( new
{
Document = product ,
Etag = e t a g . T o S tr i n g ( )
13 Note that if the document was not changed in the session, calling SaveChanges will not
throw if the document was modied server side.
78
}
p u b l i c A c t i o n R e s u l t EditProduct ( Product product , s t r i n g e t a g )
{
Etag o r i g i n a l E t a g = Etag . Parse ( e t a g ) ;
DocumentSession . S t o r e ( product , o r i g i n a l E t a g ) ;
r e t u r n Json ( new { Saved = t r u e } ) ;
}
We are sending the etag along with the document, and we get the original etag
with the document. If the document has been changed between the LoadProduct
and EditProduct request, well detect it and can show an error to the user.
By now weve seen that etags are important for index, replication and optimistic
concurrency. There is another area where etags have a big role, caching. Let us
see how that works in RavenDB.
4.7 Caching
Caching is the very rst tool that we use when we want to improve any system
performance. In the context of a database, there are actually a lot of caching
involved. We are precomputing things and storing them on disk for later14 , we
cache indexes and documents in memory so we wont have to go to disk, and
we also handle caching on the client side.
You dont actually care about the server side caching, that is an implementation
detail. You certainly benet from that, but it has no impact on your day to day
operations. The client stu, however, it is very relevant. The rst cache youll
encounter in RavenDB in the session cache:
var p1 = DocumentSession . Query<Product >()
. Where ( p => p . Name = Chef Anton s Gumbo Mix )
. F i r s t ( ) ; // r e t u r n s p r o d u c t s /5
var p2= DocumentSession . Load<Product >( p r o d u c t s / 5 ) ;
Here, we are only going to go to the database once. We rst query the database
for a product with the specied name, then we are loading a document by id.
Since that document was already loaded into the session by the query, we could
skip going to the server entirely. This is a very short lived cache, and it isnt
actually there for performance. The main reason we have this behavior is so the
session will implement the Identity Map pattern. Because the lifespan of the
session is very short, you dont get a lot of utility of such a cache, but features
such as include help make it a very important optimization technique.
14 This
4.7. CACHING
79
Usually this is where you start using a cache provider to cache the results of
queries, so you can avoid making a remote call to the database in the rst place.
However, RavenDB is a full service database, and we see no reason why you
have to write caching code.
Hand rolled / ad hoc caching
Caching should be a pretty simple process:
Check the cache, if the value is there, return it.
Otherwise, load it, put it in the cache and return it.
For something that is supposed to be so simple, it is actually really
complex when you go into the details15 . If the objects that you hold
in the cache are mutable, then you cant actually return the same
instance from the cache in multiple calls, you open yourself up for
race conditions by having two threads fetch the same instance and
modify it concurrently.
Cache misses are also pretty complex. What happens if you have two
threads that have a cache miss at the same time on the same item?
Do both of them fetch the value (increasing load on the database)
or just one of them? Cache invalidation is a topic that requires you
to juggle multiple competing concerns (liveliness, performance, data
staleness, and more)).
Because of all those factors, caching code has the following properties:
Boring
Multi-threaded
Repeatable
Performance critical
Quite tricky to get right
Combine all those properties together and you can probably see why
writing caching code isnt something you want to do.
RavenDB exposes a REST interface to the outside world, and the nice thing
about REST and HTTP is that a lot of work has already been put into thinking about caching. RavenDB took full advantage of all of that work, and the
RavenDB HTTP cache was born.
15 The
80
4.7. CACHING
81
or not the requests etag has changed, specically so the common case of no,
whatever you have in the cache is ne will be very fast.
Customizing the cache
It isnt common to need to modify the RavenDB HTTP cache, but
we dont believe in utterly blocking our users, so there is actually
quite a lot of control that you could assert over this process.
The cache is only active for GET requests, all other requests are
ignored. And by default we cache all of them. A cache that doesnt
have an eviction policy isnt a cache, it is a memory leak, and in this
case, the RavenDB cache using the Least Recently Used algorithm,
with a cap of 2,048 cached requests. You can change that by setting
the documentStore.MaxNumberOfCachedRequests property.
A cache trade o memory for time, so the higher the number of
cached requests, the more memory the cache will use, and the more
your application could serve from the cache. One scenario where
this is problematic is when you have many requests that return large
number of big documents. The result is that the cache is lled with
a lot of data, and that might cause issues.
You can ne tune what goes into the cache or not by using the
documentStore.Conventions.ShouldCacheRequest event.
When you make a request with If NoneMatch and the information has
changed, we just process the request normally, and send the results to you
along with the new etag. The end result is a system that is very fast, for the
common case you only need to check with the server if something has changed,
and you can save all the computation and bandwidth costs. At the same time,
you dont have to worry about cache invalidation or displaying out of date
results, because we conrm the accuracy of the results by checking with the
server.
Having both performance and safety is great, that is why we have made this the
default approach in RavenDB. But there is just one niggling issue still remaining.
We have to go to the database to check if our information is still up to date.
We can save the computational and bandwidth costs, but the latency of going
to the database is usually the most expensive part. That is why we have the
next level of caching.
82
Aggressive caching is an opt-in feature, and using it takes your caching to the
next level. Let us see the code in Listing 6 rst, then discuss it.
Listing 4.3: Using aggressive caching
u s i n g ( documentStore . A g g r e s s i v e l y C a c h e ( ) )
{
f o r ( i n t i = 0 ; i < 1 0 ; i ++)
{
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p r o d u c t = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
C on s o le . WriteLine ( p r o d u c t . Name ) ;
}
C on s o le . ReadLine ( ) ;
}
}
Looking at Listing 6 code, how many requests would you expect? Without
the AggressivelyCache call, we would expect there to be 10 requests. But with
it? With AggressivelyCache, we are only going to make a single request to the
server. Inside the aggressive cache scope, if we have a request in the cache, we
dont even check with the server if it is up to date or not. That means that we
can process pretty much whatever we want in memory, without ever having to
make a single remote call.
You really cant get any faster than serving directly from your own local memory.
Of course, there is a downside, because we never check with the server, which
might mean that someone will go and change the document on the server side,
and well miss this update. Now you can probably see why we are calling this
aggressive caching. Except that isnt how it really works.
Let us setup an experiment17 . Put the code in Listing 6 in a console application,
and hit enter a time or two. Then go and change the product name, and hit
enter again. Because we are using aggressive caching, we arent actually going
to go to the server and check that we have the latest version, so well expect to
see the same output. Instead, I got this:
Chai
Chai
Latte
We have aggressive caching enabled, and we didnt check with the server, but
we got the latest data anyway. How does that work?
17 I suggest looking at the RavenDB console, which shows a list of all the requests made to
the server. That can help you see when we are actually requesting data from the server and
when we are serving directly from the cache.
4.7. CACHING
83
Instead of having the RavenDB Client API check all the time if the cached
information is up to date (polling), we are reversing the ow (pushing). The
client asks the database to let it know if there has been any changes. As long as
it didnt get a change notication from the database, the client can safety serve
directly from the cache, with a high degree of condence that the information
it gives you is up to date. And when it does get a change notication, it merely
needs to use the default caching route, where we go to the server to check if our
results are actually up to date or not.
Cache complexity
It is easy to think that aggressive caching is going to use the details
in the change notications to selectively invalidate parts of the cache.
However, that isnt the case18 . The HTTP cache is pretty ignorant of
the way RavenDB works, and even if we tried adding the knowledge
to it, that would be very hard to handle.
Consider for example the case of changing a specic product document. It is easy to see that the url for loading that document
should be invalidated. But what about a url for documents by name?
Should it be invalidated? What about the url for an order that has
an include for that product?
It is impossible to try to answer those questions without doing so
much work that the benets of cache would be nil, so we dont,
instead we use a far simpler route, a change notication invalidate
the entire cache.
Setting up aggressive caching also setup the change notications subscription,
and getting a change notication will cause the entire cache to be invalidated.
That sounds scary, and far too aggressive19 . Any change? The whole cache?
That has got to pretty much kill this feature for real world purposes, right? But
we arent clearing the cache, most of the data we already have in the cache is
still going to be up to date. The way it works, each cache entry has a time,
which mark when it was fetched from the database. And we track the time of
the last change notication from the server. When the last change notication
from the server is greater than the time the cache entry was fetched, we have to
go to the database to make sure that this is still consistent.
Because of this behavior, aggressive caching is almost perfect. You can usually
serve the data directly from your own local cache, without having to make any
remote calls, and at the same time, you are notied of any changes and can check
with the server very quickly. Note that very quickly does not mean immediately.
While the actual latency between a document being changed and the client
18 I
19 I
originally spelled this: that isnt the cache, but decided that is was a bad pun
couldnt resist the pun this time
84
being notied about this change is very short (in the order of milliseconds in
the common case), that isnt zero20 .
This means that it is possible for you to load the document from the cache even
though 2 milliseconds ago it was changed. This violates one of the basic tenants
of RavenDB, that access to the documents is fully ACID and immediately consistent. That is why you need to explicitly ask for this feature. I wholeheartedly
recommend taking advantage of this feature, but you need to consider what aspects of your code can accept potentially cached request, and what requires full
consistency.
4.8 Summary
Phew! This has been a long chapter. We covered a lot of the basic concepts
in RavenDB from Entities & Aggregate Roots to collections and metadata. We
then started to dive into deeper integration of your application with RavenDB.
On to identiers, what not to do (Guids) and the various choices that you have
with identiers. Most importantly, in my opinion, identiers should be human
readable. Because they are a key part of how you work with your system. The
RavenDB options we have, hilo, identity and semantic ids all follow the same
principle. They are workable for humans rst, and machines later.
After covering ids in great detail, we went on to etag, how they are composed and
what we do with them. We look at the server side use of indexing and replication
and at client side use of etags with optimistic concurrency (including end to end
optimistic concurrency).
Finally, we looked at caching, and saw that on the client side, RavenDB has
three dierent caching options. The session cache, which is primarily used for
Identity Map. The HTTP cache, which uses the etags and If NoneMatch
header to check if the information has changed server side. Even better, we
have Aggressive Caching, which can skip going to the server entirely, and can
use change notications to decide when to invalidate the cache.
On the next chapter, well continue to dive deeper into the client API. Well talk
about advanced features such as streaming results, bulk insert, subscribing to
database changes and using partial document updates. After that, well move
on to part two, indexing.
Chapter 5
86
which tells to RavenDB that well want the associated documents as well as the
one we asked for. But what is probably the most interesting way to reduce the
number of remote calls is to be lazy. Let us rst look at Listing 5.1 and then
well discuss what is going on there.
Listing 5.1: Using Lazy Operations
var l a z y O r d e r = DocumentSession . Advanced . L a z i l y
. I n c l u d e <Order >(x => x . Company )
. Load ( o r d e r s / 1 ) ;
var l a z y P r o d u c t s = DocumentSession . Query<Product >()
. Where ( x=>x . Category == c a t e g o r i e s / 2 )
. Lazily ( ) ;
DocumentSession . Advanced
. Eagerly . ExecuteAllPendingLazyOperations ( ) ;
var o r d e r = l a z y O r d e r . Value ;
var p r o d u c t s = l a z y P r o d u c t s . Value ;
var company = DocumentSession . Load<Company>( o r d e r . Company ) ;
// show o r d e r , company & p r o d u c t s t o t h e u s e r
In Listing 5.1, instead of writing DocumentSession.Include(...).Load (...) , we
used the Advanced.Lazily option, and instead of ending the query with a ToList
we used the Lazily extension method. That much is obvious, but what does
this mean? When we use lazy operations, we arent actually executing the
operation. We merely register that we want that to happen. It is only when
Eagerly.ExecuteAllPendingLazyOperations is called that those operation are
executing, and they all happen in one round trip.
Consider the case of a waiter in a restaurant, when the waiter is taking the order
from a group of people, it is possible for him to go to the kitchen and let the
cook know about a new Fish & Chips1 plate whenever a member in the group
is making an order. But it is far more ecient to wait until every member in
the group has ordered, and then going to the kitchen once.
Eagerly.ExecuteAllPendingLazyOperations
Order.Value
vs.
lazy-
87
88
There are 880 orders in the Northwind database, how many results will this
query return? As you can imagine, the answer is not 880. Code like this snippet
is bad, because it makes assumptions about the size of the data. What would
happen if there were three million orders in the database? Would we really want
to load and materialize them all? This problem is called Unbounded Result Set,
and it is very common in production, because it sneaks up on you. You start
out your system, and everything is fast and ne. Over time, more and more
data is added, and you end up with a system that slows down. In most cases
Ive seen, just after reading all of the orders, we discarded 99.5% of them and
just showed the user the last 15.
With RavenDB, such a thing is not possible. If you dont specify otherwise, all
queries are assumed to have a .Take(128) on them. So the answer to my previous
question is that the snippet above would result in 128 order documents being
returned. Some people are quite upset with this design decision, their argument
boils down to: This code might kill my system, but I want it to behave like I
expect it to. Im not sure why they expect to kill their system, but they can
do that without RavenDBs help2 . That way, hopefully it wouldnt be us that
would get the 2 AM wakeup call and try to resolve what is going on.
Naturally, a developers rst response to hearing about the default .Take(128)
clause is this:
var o r d e r s = DocumentSession . Query<Order >()
. Take ( i n t . MaxValue ) // f i x RavenDB bug
. ToList ( ) ;
This is why we have an additional limit. You can specify a take clause up to
1024 in size. Any value greater than 1024 will be interpreted as 1024. Now,
if you really want, you can change that by specifying the Raven/MaxPageSize
conguration, but we very strongly recommend against that. RavenDB is designed for OLTP scenarios, and there are really very few situations where you
want to read a lot of data to process a users request.
For the situations where you actually do need all the data, the Query API isnt
really a suitable interface for that. That is why we have result streaming in
RavenDB.
89
can consume the data at a reasonable rate, and reduce the overall load on the
entire system.
But there is one relatively common case where you do need to have access to
the entire dataset: Excel.
More properly, any time that you need to give the user access to a whole lot
of records to be processed oine. Usually you output things in a format that
Excel can understand, so the users can work with the data in a really nice tool.
But reports in general are a very common scenario for this requirement.
So how can we do that? One way to handle that would be to just page, something like the abomination in Listing 5.2:
Listing 5.2: The WRONG way to get all users
p u b l i c L i s t <User> G e t A l l U s e r s ( )
{
// This code i s an example o f how NOT
// t o do t h i n g s , do NOT t r y t o u s e i t
// e v e r . I mean i t !
L i s t <User> a l l U s e r s = new L i s t ( ) ;
int start = 0;
while ( true )
{
u s i n g ( var s e s s i o n = DocumentStoreHolder . OpenSession ( ) )
{
var c u r r e n t = s e s s i o n . Query<User >()
. Take ( 1 0 2 4 )
. Skip ( s t a r t )
. ToList ( ) ;
i f ( c u r r e n t . Count == 0 )
break ;
s t a r t+= c u r r e n t . Count ;
a l l U s e r s . AddRange ( c u r r e n t ) ;
}
}
return allUsers ;
}
I call the code an abomination because it has quite a few problems. To start
with, this code will work just ne on small amount of data, but as the data size
grows, it will do more and more work, and consume more and more resources.
Let us assume that we have a moderate number of users, a 100,000 or so. The
cost of the code in Listing 5.2 is: go to the server 98 times, do deeper paging
on each request, hold the entire 100,000 users in memory.
Note that the code there is also evil because it uses a dierent session per loop
90
iteration, preventing RavenDB from detecting the fact that this code is going
to hammer the database with requests. Another problem is when can we start
processing this? What happens is the code buers all the results, so we have to
wait until the entire process is done to start handling this. All of those mean
longer duration, higher memory usage and a lot of waste.
And what happens if someone is adding or deleting documents while this code is
running? We dont make a single request, so it happens in multiple transactions,
so there is also this issue. In short, never write code like this. RavenDB has builtin support for properly handling large number of results, and it is intentionally
modelled to be ecient at that scale. Say hello to streaming. Listing 5.3 is the
same GetAllUsers method, now written properly.
Listing 5.3: The proper way to get all users
p u b l i c IEnumerable<User> G e t A l l U s e r s ( )
{
var a l l U s e r s = DocumentSession . Query<User > ( ) ;
IEnumerator<StreamResult<User>> stream =
DocumentSession . Advanced . Stream ( a l l U s e r s ) ;
w h i l e ( stream . MoveNext ( ) )
{
y i e l d r e t u r n stream . Current . Document ;
}
}
Beside the fact that there is a lot less code here, let us take a look at what this
code does. Instead of sending multiple queries to the server, we are making a
single query. We also indicate that this is a streaming query, which means that
the RavenDB Client API will have a very dierent behavior, and use a dierent
endpoint for processing this query.
Paging and streaming
By default, a stream will fetch as many results as you need (up to
2.1 billion or so), but you can also apply all the normal paging rules
to a stream as well. Just as a .Take(10 * 1000) to the query before
you pass it to the Stream method.
On the client side, the dierences are:
The use of enumerator to immediately expose the results as they stream
in, instead of waiting for all of them to arrive before giving anything back
to your code.
The result of the Stream operation is not tracked by the session. There can
be a lot of results, and tracking them would put a lot of memory pressure
on your system. It is also very rare to call SaveChanges on a session that
is taking part in streaming operation, so we dont lose anything.
91
The dedicated endpoint on the server side has dierent behavior as well. Obviously, it does not apply the Raven/MaxPageSize limit. But much more importantly, it will stream the results to you without doing any buering. So
the client can start processing the results before the server has nished sending them. Another benet is consistency, as throughout the entire streaming
operation, we are going to be running under a single transaction.
What happens is that we operate on a database snapshot, so any addition,
deletions or modications are just not visible. As far as the streaming operation
is concerned, the database is frozen at the time of the beginning of the streaming
operation.
RavenDB Excel Integration
I mentioned earlier that a very common use case for streaming is the
need to expose the database data to users in Excel format. Because
this is such a common scenario, RavenDB comes with a dedicated
support for that. The output of the Stream operation is usually
JSON, but we can ask RavenDB to output it in CSV format that can
be readily consumed in Excel. You can go to the Excel Integration
documentation page to see the walk through.
The nice thing about that is that you can even update the data from
the database into Excel after the rst import.
That said, please read the Reporting Chapter for fuller discussion
on how to handle reporting with RavenDB. In general just giving
users access to the raw data in your database results in a mess down
the road.
The Stream operation accept a query, or a document id prex, or even just the
latest etag that you have (which allow you to read all documents in update
order). This is the usual way youre going to use to fetch large amount of data
from RavenDB.
But what if I want to go the other way around? What if I want to save a lot of
data into RavenDB? That is why we have the bulk insert operation.
92
}
Listing 5.4 is good because well only have to go to the database once, right?
We call SaveChanges, and things are saved in an optimal fashion. The answer
to that is a denite maybe. The problem with giving a good answer is that we
are missing a very important piece of information. How big is the users .csv le?
If it is in the range of hundreds to low thousands of users, that is a great way
to handle that. If it is more than that, we have a problem. The session isnt
really meant for bulk operations. Let us assume that the users .csv le contains
50,000 users. That would mean that we would load 50,000 users to memory,
then create a single request with a payload of half a million documents in it,
then send it to the server as a single transaction.
The likely reason is that well get an out of memory exception at some point
along the way. In particular, there is a limit to how much data can be changed
in a single transaction. Admittedly, this limit is in the hundreds of megabytes3 ,
but it is there. A very long write transaction is also something that we would
like to avoid, because it is pretty costly.
A common solution is to use multiple sessions, so well use batch up to 512 users,
then well call SaveChanges, then we create a new session. The downside of this
approach is that we still have relatively large requests, and not we have a lot of
them.
Trying this option with 50,000 users, we get the code in Listing 5.5. This is
quick & dirty code, written merely to show a point.
Listing 5.5: Inecently insert new users, second option
i n t amount = 0 ;
var s e s s i o n = documentStore . OpenSession ( ) ;
f o r e a c h ( var i i n Enumerable . Range ( 0 , 5 0 * 1 0 0 0 ) )
{
User u s e r = new User
{
Name = H e l l o +i
};
s e s s i o n . Store ( user ) ;
i f ( amount++ > 5 1 2 )
{
3 This limit is controlled via the Raven/Esent/MaxVerPages option in Esent, and by the
Raven/Voron/MaxScratchBuerSize option in Voron.
93
amount = 0 ;
s e s s i o n . SaveChanges ( ) ;
s e s s i o n . Dispose ( ) ;
s e s s i o n = documentStore . OpenSession ( ) ;
}
}
s e s s i o n . SaveChanges ( ) ;
s e s s i o n . Dispose ( ) ;
This code runs in 21.65 seconds, or a rate of 2,310 documents/second. It is not
great, but it isnt bad either. Trying with a batch size of 1024 resulted in run
time of just over 25 seconds and a batch size of 256 took 26 seconds, so for this
work load, 512 seems to be working ne.
Such tasks arent frequent in RavenDB. It isnt often that you need to put
so much data into the database. However, we have to deal with yet another
common case. The nightly ETL process. Every night we get a le from some
other system that we need to load into our system. That means that we want
to be able to give you a good solution for this issue.
Just telling you to do batched SaveChanges isnt enough. Hence, the need for
bulk insert. Bulk insert operates in the exact opposite manner than streaming.
Let us see the code in Listing 5.6 and then well discuss what is going on.
Listing 5.6: Ecently insert new users using bulk insert
u s i n g ( var b u l k I n s e r t = documentStore . B u l k I n s e r t ( ) )
{
f o r e a c h ( var i i n Enumerable . Range ( 0 , 50 * 1 0 0 0 ) )
{
User u s e r = new User
{
Name = H e l l o + i
};
bulkInsert . Store ( user ) ;
}
}
As you can see, we dont do batching in Listing 5.5. We create a single BulkInsert
and use that. What actually happens is that the BulkInsert is using a single
long request to talk to the server. Whenever bulkInsert .Store is called, we
are sending that data to the server immediately. On the server side, it is also
processed immediately, instead of having to wait for everything to get to the
server.
Bulk Insert Batches
Internally, we dont actually send each document over the network
independently. We have a batch size & time limit. We batch all the
94
More than anything else, the fact that we can parallelize client and server work
means that we get a really good performance. It doesnt hurt that the code
path that BulkInsert is using is also highly optimized for inserts, as you can
imagine.
BulkInsert can also accept an options argument:
documentStore . B u l k I n s e r t ( o p t i o n s : new B u l k I n s e r t O p t i o n s ( ) ) ;
Those options include:
BatchSize - For how many documents should we wait before sending a
batch to the server. The default is 512.
OverwriteExisting - If set to true, allows RavenDB to overwrite an existing
document, otherwise, an error is thrown if the inserted document already
exists. Default is false, setting this to true will reduce the insert speed.
CheckReferencesInIndexes - Whatever document references (resulting
from LoadDocument) need to be checked. Well discuss this feature in
detail in the Part II. Default is false, setting this to true will reduce the
insert speed.
WriteTimeoutMilliseconds - How much time we can wait for the full queue
to clear. Default is 15,000 ms (15 seconds).
The last options deserve some additional explanation. One issue with bulk insert
is that the client side can usually generate the documents far faster than the
server can receive and store them. If the number of documents is high, it is
possible that the number of documents waiting to be sent to the server will be
very high. Eventually, youll run out of memory.
Because of that, the BatchSize option also controls how many documents we
can have waiting to be sent. The queued documents will be up to 150% of the
size of the BatchSize. Assuming the BatchSize is set to 512, then the maximum
number of pending documents will be 768. At this point, if you try to Store an
95
additional document, youll be blocked until another batch has been processed
and space on the queue is freed.
To prevent a situation where youre blocked for a very long time, we use the
WriteTimeoutMilliseconds value to make sure that a timeout exception is thrown
if we are waiting for too long.
At the end of the day, inserting 50,000 documents with BulkInsert took 11.33
seconds, for a rate of 4,415 documents/second. Or about twice as fast as the
alternative.
Benchmark & lies
A note about the performance numbers. Im not trying to do a
benchmark here for aboslutely the best performance. Im actually
running this with a debug build of RavenDB on both the client
& server, while the system is also running integration tests in the
background, on a laptop, while riding the train. Dont trust those
numbers!
We can get to 15,000 - 20,000 sustained writes per second on standard server hardware, using a release build, and we can go beyond
that by conguring the database settings properly (well discuss
those option in Part IV - Operations).
What Im trying to do is to give you a sense of the relative performance dierences between the session and bulk insert. And in
general, bulk insert in going to be between twice to ten times as fast
as using SaveChanges and batching.
The last, but certainly not least, important thing about BulkInsert. If you look
at Listing 5.6, you can see that it is wrapped in a using statement. This is
important, the BulkInsert is using the Dispose call to ush the remaining data
to the server, close the connection and in general clean after itself. It is only
after the Dispose has completed that you can be certain that all the data that
was bulk inserted is actually safe inside the database.
So far we talked about the big stu. How we can read a lot of data and write a
lot of data. Now I want to turn in the complete opposite direction and go into
the very small. How can I update just a part of a document?
96
But there are good reasons4 why youll want to do just a partial document
update. There are two common reasons to want to do that:
You have a piece of data that have a good legitimate reason to be change
concurrently.
You want to save the cost of loading the full document and saving the full
document.
Both reasons are somewhat problematic. Because a document is a unit of change,
there should be only a single reason to change it, and all updates to a document
are serialized. Reasons for wanting to change a document concurrently should
be rare. One such example might be adding a comment to a blog post. Because
there are no associations between comments, it is valid to have two comments
being added to the same blog post at the same time.
Wanting to save the cost of loading and saving the full document is a warning
sign. That usually points to a problem in the way you are structuring your
documents. Well discuss this further in the Modeling Chapter. For now, it is
important to note that regardless of how you wish to modify a document (full
update or patching), on the server side, the eect is always replacing the whole
object.
With the cautionary words out of the way, the main advantages of patching
are that we can handle concurrency in a more granular fashion and that we
are generally sending (a lot) less data over the wire. Usually, if we have two
concurrent modications to the same document, that would generate a ConcurrencyException on one side. That is because we dont know which version
should win.
With patching, we dont have the new version of the document; we have a
description of the change we want to make to the document. And now let us
see how we can actually execute partial updates.
Set a property
Unset (remove) a property
Add an item to an array
Insert an item to an array at a specied location
4 While there are good reasons for that, they are also pretty rare situations. For the most
part, patching should be the exception
97
Listing 5.7 shows an example of using the simple patch API to reduce the level
of product in stock.
Listing 5.7: Using simple patching to decrement products in stock value
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
s e s s i o n . Advanced . D e f e r ( new PatchCommandData
{
Key = p r o d u c t s / 1 ,
Patches = new [ ]
{
new PatchRequest
{
Name = U n i t s I n S t o c k ,
Type = PatchCommandType . Inc ,
Value = 1
},
}
});
s e s s i o n . SaveChanges ( ) ;
}
Additional options that you can set on the PatchRequest include what to do if
the document doesnt exists (well run a dierent set of patch commands located
in the PatchIfMissing property). Or we can specify that well only change a
property value if its value match the PrevVal property, etc.
session.Advanced.Defer
The Advanced.Defer method allow you to register a low level command (such as PatchCommandData) to be carried out when the
sessions SaveChanges is called.
This allows you to add commands to the same transaction that will
occur when SaveChanges happen along with all the other changes
in the session.
To be perfectly frank, simple patching is hard, and it isnt really nice to use or
very exible. Weve added a lot of options to it over the years. But fundamentally it remained not very friendly. In general we recommend that youll avoid
using this in favor of Scripted Patching.
98
99
Limitations
Because scripted patches are run on the server side, we need to be
cautious about their use. RavenDB will ag and kill any script that
is obviously abusive (trying to create stack overow, innite loop,
etc.). By default, a script is limited to about 10,000 operations,
after which time it is killed.
However, especially with large documents or complex scripts, it can
take a while for RavenDB to execute a script. RavenDB is doing
quite a lot to optimize script execution, including caching the parsed
scripts, but it is still requiring us to evaluate the scripts, and that
has a non-trivial cost.
5.5.2.1 Parameters
Frequently, you need to customize your script to allow for dierent options. For
example, if we look at Listing 5.8, we can see that we decrement the units in
stock by one. However, what would happen if we wanted to decrement the units
in stock by 7, or 5?
One wrong way of doing that would be the following:
S c r i p t = t h i s . U n i t s I n S t o c k = + amountToDecrement + ; ;
Allow me to count the number of ways this is wrong. Just like building SQL
strings using concatenations, this is wrong. It produces hard to read code, make
it much harder to cache the scripts and introduces the possibility of the user
input injection.
Instead, just like SQL again, we have a much better option, using parameters.
See Listing 5.9.
Listing 5.9: Using scripted patching with parameters
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
s e s s i o n . Advanced . D e f e r ( new ScriptedPatchCommandData
{
Key = p r o d u c t s / 1 ,
Patch = new S c r i p t e d P a t c h R e q u e s t
{
S c r i p t = t h i s . U n i t s I n S t o c k = amountToDecrement ; ,
Values = {{ amountToDecrement , 7}}
}
});
s e s s i o n . SaveChanges ( ) ;
}
100
You can see that we pass a variable amountToDecrement to the script. The
advantage of a variable is that you dont have to worry about input injection,
you dont have to build the script by string concatenation and you can fully
cache the parsed script and reuse it many times.
5.5.2.2 Accessing other documents
One of the really nice things about the scripted patching API is that it not only
give you full access to the current document (exposed as the this object), but
it also gives you full access to other documents. You can call LoadDocument to
load another document and use its values to modify your document, as you can
see in Listing 5.10. Or you can call DeleteDocument to remove a document, or
even call PutDocument to create or update another document.
Listing 5.10: Loading a related document during patching
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
s e s s i o n . Advanced . D e f e r ( new ScriptedPatchCommandData
{
Key = p r o d u c t s / 1 ,
Patch = new S c r i p t e d P a t c h R e q u e s t
{
S c r i p t = @
var c a t e g o r y = LoadDocument ( t h i s . Category ) ;
t h i s . CategoryName = c a t e g o r y . Name ;
,
}
});
s e s s i o n . SaveChanges ( ) ;
}
You can read the full details about the DeleteDocument and PutDocument
methods in the online documentation, since they are far less used than
LoadDocument.
5.5.3 Concurrency
What will happen if you have to patch requests to the same document at the
same time? Because they arent full document update, it is actually possible to
make this work. What will happen is that the RavenDB engine will serialize all
those patch requests to the document.
Then it will execute them one at a time. Since they represent changes to the
document, which is safe to do, so that one of the main use cases for patching
is to concurrently update documents. That said, note that if you have a lot of
5.6. LISTENERS
101
patch requests to the same document, eventually the internal queue RavenDB
uses will be full and patch requests for this document will be rejected. If there
are other operations in the same transaction, they will also fail as a single unit.
And with that, let us zoom out from the partial document updates to a far
bigger scope. How to apply application wide behaviors using RavenDB using
listeners?
5.6 Listeners
It is pretty common to want to run some code whenever something happens in
RavenDB. The classic example is when you want to store some audit information
about who modied a document. In the previous section, we saw that we can
do that manually, but that is both tedious and prone to errors or omissions. It
would be much better if we could do it in a single place.
That is why the RavenDB Client API has the notion of listeners. Listeners allow
you to dene, in a single place, additional behavior that RavenDB will execute
at particular points in time. RavenDB has the following listeners:
102
}
public class PreventActiveUserDeleteListener :
IDocumentDeleteListener
{
p u b l i c v o i d B e f o r e D e l e t e ( s t r i n g key ,
o b j e c t e n t i t y I n s t a n c e , RavenJObject metadata )
{
var u s e r = e n t i t y I n s t a n c e a s User ;
i f ( u s e r == n u l l )
return ;
i f ( user . IsActive )
throw new I n v a l i d O p e r a t i o n E x c e p t i o n (
Cannot d e l e t e a c t i v e u s e r : +
u s e r . Name ) ;
}
}
public c l a s s OnlyActiveUsersQueryListener :
IDocumentQueryListener
{
p u b l i c v o i d BeforeQueryExecuted (
IDocumentQueryCustomization q u e r y C u s t o m i z a t i o n )
{
var userQuery = q u e r y C u s t o m i z a t i o n a s
IDocumentQuery<User >;
i f ( userQuery == n u l l )
return ;
userQuery . AndAlso ( ) . WhereEquals ( I s A c t i v e , t r u e ) ;
}
}
5.6. LISTENERS
103
In the PreventActiveUserDeleteListener case, we throw if an active user is being deleted. This is very straightforward and easy to follow. It is the case
of OnlyActiveUsersQueryListener that is interesting. Here we check if we are
querying on users (by checking if the query to customize is an instance of
IDocumentQuery<User>) and if it is, we also add a lter on active users only.
In this manner, we can ensure that all user queries will operate only on active
users.
We register the listeners on the document store during the initialization.
Listing 5.12 shows the updated CreateDocumentStore method on the
DocumentStoreHolder class.
Listing 5.12: Registering listeners in the document store
p r i v a t e s t a t i c IDocumentStore CreateDocumentStore ( )
{
var documentStore = new DocumentStore
{
Url = h t t p : / / l o c a l h o s t : 8 0 8 0 ,
D e f a u l t D a t a b a s e = Northwind ,
};
documentStore . R e g i s t e r L i s t e n e r (
new A u d i t S t o r e L i s t e n e r ( ) ) ;
documentStore . R e g i s t e r L i s t e n e r (
new P r e v e n t A c t i v e U s e r D e l e t e L i s t e n e r ( ) ) ;
documentStore . R e g i s t e r L i s t e n e r (
new OnlyQueryActiveUsers ( ) ) ;
documentStore . I n i t i a l i z e ( ) ;
r e t u r n documentStore ;
}
Once registered, the listeners are active and will be called whenever their respected actions occur.
The IDocumentConversionListener allows you a ne grained control over the
process of the conversion process of entities to documents and vice versa. If you
need to pull data from an additional system when a document is loaded, this is
usually the place where youll put it5 .
A far more common scenario for conversion listener is to handle versioning,
whereby you modify the old version of the document to match an update entity
denition on the y. This is a way for you to do rolling migrations, without an
expensive stop-the-world step along the way.
5 That said, pulling data from secondary sources on document load is frowned upon, documents are coherent and independent. You shouldnt require additional data, and that is
usually a performance problem
104
While the document conversion listener is a great aid in controlling the conversion process, if all you care about is the actual serialization, without the need
to run your own logic, it is probably best to go directly to the serializer and use
that.
105
var p a r t s = r e a d e r . ReadAsString ( ) . S p l i t ( ) ;
r e t u r n new Money
{
Amount = d e c i m a l . Parse ( p a r t s [ 0 ] ) ,
Currency = p a r t s [ 1 ]
};
}
p u b l i c o v e r r i d e b o o l CanConvert ( Type o b j e c t T y p e )
{
r e t u r n o b j e c t T y p e == t y p e o f ( Money ) ;
}
}
p u b l i c c l a s s Money
{
p u b l i c s t r i n g Currency { g e t ; s e t ; }
p u b l i c d e c i m a l Amount { g e t ; s e t ; }
}
The idea in Listing 5.13 is to have a Money object that holds both the amount
and the currency, but to serialize it to JSON as a string property. So a Money
object representing 10 US Dollars would be serialized to the following string:
10 USD.
The JsonMoneyConverter converts to and from the string representation and
the json serializer customization event registers the converter with the serializer.
Note that this is probably not a good idea, and you will want to store the Money
without modications, so you can do things like sum up order by currency, or
actually work with the data.
I would only consider using this approach as an intermediary step, probably as
part of a migration if I had two versions of the application working concurrently
on the same database.
106
Polling is wasteful, most of the time you spend a lot of time asking the same
question and expecting to get the same answer. Anyone who had to deal with
are we there yet? and are we there yet, now? knows how annoying that can
be. Setting aside that, we have to consider load and latency factors as well. In
short, we dont want to do polling. So it is good that we dont have to. Listing
5.14 shows us how, note that this code uses the Reactive Extensions package
( Install Package RxCore using NuGet).
Listing 5.14: Registering for Changes() notications
Using the code in Listing 5.14, we registered for all changes (Put, Delete) for
the products/2 document. We can take actions, such as notify the user (using
SignalR if we are running in web application, for example). You can register for
notications on specic documents, all documents with a specic prex or of a
specic collection, for all documents changes or for updates to indexes.
Notice that the change notication include the document (or index) id and the
type of the operation performed. Put or Delete in the case of documents, most
often. If you want to actually access the document in question, youll need to
load it using a session.
Another important issue when dealing with notications: once subscribed, youll
continue to get notications for the subscriptions until you have disposed the
subscription (the last line in Listing 5.14). And obviously, your subscription is
going to be called on another thread, so you need to be aware that you might
need to marshal your action to the appropriate location.
And this is pretty much it for Changes(), it is a powerful feature, and it enable
a whole host of interesting scenarios. But from external point of view, it is also
drop dead simple to work with and use. And with that in mind, let us look at
another such feature, result transformers and what you can do with them.
107
108
}
// u s a g e
p u b l i c c l a s s JustOrderIdAndcompanyName : A b s t r a c t T r a n s f o r m e r C r e a t i o n T a s k <
{
public c l a s s Result
{
p u b l i c s t r i n g Id { g e t ; s e t ; }
p u b l i c s t r i n g CompanyName { g e t ; s e t ; }
p u b l i c DateTime OrderedAt { g e t ; s e t ; }
5.10. SUMMARY
109
}
p u b l i c JustOrderIdAndcompanyName ( )
{
T r a n s f o r m R e s u l t s = o r d e r s =>
from o r d e r i n o r d e r s
l e t company = LoadDocument<Company>( o r d e r . Company )
s e l e c t new { o r d e r . Id , CompanyName = company . Name , o r d e r . OrderedAt } ;
}
}
// u s a g e
var o r d e r = s e s s i o n . Load<JustOrderIdAndcompanyName , JustOrderIdAndcompanyName . R e s u
C o n s o l e . WriteLine ( { 0 } \ t {1}\ t { 2 } , o r d e r . Id , o r d e r . CompanyName , o r d e r . OrderedAt ) ;
Notice what we are doing in the JustOrderIdAndcompanyName transformer.
We are using the LoadDocument method inside the transformer to load the
associated company document, then we project just the company name out to
the client.
This feature gives us complete control over how to shape the data from the
server. And being able to pull data from associated documents is very powerful.
Just to complete this discussion, the over the wire cost is less than 300 bytes.
And in real world situations, the actual saving is far more impressive.
You arent limited to just one document to be loaded. We could have gotten
the employees name for this order, in much the same way we go the companys
name.
This is actually just a taste of what you can do with transformers. Well run
into them again when we discuss indexing, which is where result transformers
really shine.
5.10 Summary
We covered a lot of ground in this chapter. We started by talking about lazy
loading and how we can use it to reduce the number of remote calls we are
making, then diverged into talking about Safe by Default and how we have a
budget that limits the number of remote calls we can make. Hence, the numerous
ways that we have in the RavenDB Client API to reduce the number of remote
calls you make (and along the way, improve your applications perfromance).
Next, and along the same lines, we covered another Safe by Default topic, preventing Unbounded Result Sets. RavenDB does that by specifying a limit (of
110
128 results) to the number of results a query will return, if you dont specify
such a limit yourself, and by enforcing a maximum upper limit (of 1,024 results,
by default) for the total number of results that can be specied.
But whenever there is a limit, there is also a way to avoid it. And no, that isnt
by using .Take(int.MaxValue). You can get the entire result set, regardless of
size, when you using the Streaming API. This API is meant to deal with large
number of results, and it will stream results on both client and server so you
can parallelize the work.
The other side of getting a very large result set from the database is actually
inserting a lot of data to the database. And we looked at doing that using
SaveChanges and then using the dedicated API for that, BulkInsert. From the
very big, we moved to the very small, looking at how we can do partial document
updates using patching. We looked at simple patching and scripted patching,
what we can do with them and how they work.
And from updating just part of a document, we moved to handling cross cutting
concerns via listeners. More specically, how you can use listeners to provide
application wide behavior from a single location. For example, handling auditing
metadata in a listener means that you never have to worry about forgetting
to write the audit code. We also learned about the serialization process and
how you can have ne grained control over everything that goes on during
serialization and deserialization.
Closing this chapter, we look at the Changes() API, which allows us to register
for notications from RavenDB whenever a document changes (as well as a
host of other stu) and result transformers, which allow us to run a server side
transformation of the data before it is sent to the client.
Wow, that was a lot of stu to go through. This concludes Part I, which gives
you the basic tools of how to use RavenDB. Next, we are going to start talking
about indexing, and all the exciting things that you can do with them. Go get
some coee: youre going to want to be awake for what is coming.
Part II
In this part, well learn about indexing and querying:
Deep dive into RavenDB Indexing implementation
Ad-hoc queries, automatic indexes and the query optimizer
Why do we need indexes?
Dynamic & static indexes
Simple (map only) indexes
Full text searching, highlights and suggestions
Multi Map indexes
Map/Reduce Indexes
Spatial queries
Facets and Dynamic aggregation
Advanced querying options
111
112
Chapter 6
115
Chocolade
Aniseed Syrup
Cte de Blaye
Escargots de Bourgogne
Camembert Pierrot
Filo Mix
Carnarvon Tigers
Flotemysost
Chai
Geitost
Chang
Genen Shouyu
Chartreuse verte
Gorgonzola Telino
So how can this be an index? The answer is that this isnt actually the index.
The link expression above is actually the index denition, this determines what
will be indexed (as well as exactly how, but well touch on that later). How does
that work?
Let us look at Listing 6.1, this is the external representation of the index, but
internally, we add need to track where the details came from so the end result
is:
from p r o d u c t i n d o c s . Products
s e l e c t new { p r od u c t . Name , ** p r o du c t . __document_id** }
The output of the index denition is a list of objects with a Name and a
__document_id property. But what can we do with this?
The following isnt actually how this work in RavenDB, well get to
the full details of that in a bit. This is an attempt to explain how
RavenDB works by simplifying things as much as possible.
117
database can contain a lot of documents, it isnt practical to run the indexing
function over the entire data set every time. Instead, we do incremental indexing, and we do that using the etag. Listing 6.4 shows a simplied version of how
indexing works.
Listing 6.4: Highly simplied indexing process
while ( databaseIsRunning ) {
var l a s t D o c E t a g = GetLatestEtagForAllDocuments ( ) ;
var l a s t I n d e x E t a g = GetLastIndexedEtagFor ( Products /ByName ) ;
i f ( l a s t D o c E t a g == l a s t I n d e x E t a g ) {
// i n d e x i s up t o d a t e
WaitForDocumentsToChange ( ) ;
continue ;
}
var docsToIndex = LoadDocumentsAfter ( l a s t I n d e x E t a g )
. Take ( autoTuner . B a t c h S i z e ) ;
f o r ( var indexEntry i n in dex ing Fu n c ( docsToIndex ) ) {
S t o r e I n I n d e x ( indexEnt ry ) ;
}
S e t L a s t I n d e x E t a g ( Products /ByName , docsToIndex . Last ( ) . Etag ) ;
}
Ill repeat again that Listing 6.4 shows the conceptual model, the actual working
of this is very dierent. But the overall process is similar in intention if not in
practice.
Indexing works by pulling a batch of documents from the document storage,
applying the indexing function to them, and then writing the index entries to
the index. We update the last indexed etag, and repeat the whole process again.
When we run out of documents to index, we wait until a document is updated
or changed, and the whole process starts anew.
This means that when we index a batch of documents, we just need to update
the index with their changes, no need to do all the work from scratch. You can
also see in the code that we are processing the documents in batches (which
are auto tuned for best performance). Even though the code in Listing 6.4 is
far away from how it actually works, it should give you a good idea about the
overall process.
119
It is just too costly to do so. In order to keep its promise, the relational engine
has to take locks, and do a lot of extra work to isolate dierent transactions
from each other. The more transactions you have, the higher the cost. Until at
a certain point, the relational database is overload and it will throw its hands
up in the air and go sit in the corner while it is having a funk.
Your application, in the meantime, will start erroring (if you are lucky) or just
hang, waiting for the relational database to respond.
With RavenDB, you get a dierent kind of promise. RavenDB will promise to:
Give your immediate results based on what we currently have in the index.
If the current state of the index isnt up to date, RavenDB will tell you
so, including how up to date the index is.
RavenDB will make its best to reduce the indexing latency.
You have the option to explicitly wait for the results to become non stale.
Documents are always consistent
RavenDB uses the terms stale and non-stale to refer to out of date
indexes. This is done intentionally, because consistency is always
maintained.
Listing 6.3 showed how queries work, from a logical perspective. We
rst consult the index to nd a match for the query we are executing.
Once we have the match, we have the relevant document id. Using
that document id, we go to the document storage and load that
document by id.
The RavenDBs document storage subsystem is fully consistent1 , so
you are always getting the latest committed version of the document.
Why is it important that RavenDB has a dierent set of promises than a relational database?
Because by making a dierent set of promises, we have opened ourselves to a
great deal of optimization opportunities. Here are just a few of those that are
implemented in RavenDB.
121
123
125
6.5. LUCENE
127
6.5 Lucene
I mentioned earlier that we are using Lucene to store our indexes. But what is
it? Lucene is a library to create an inverted index. It is mostly used for full
text searching and is the backbone of many search systems that you routinely
use. For example, Twitter and Facebook are using Lucene, and so does pretty
much anyone else. It has got to the point that other products in the same area
always compare themselves to Lucene.
Now, Lucene has a pretty bad reputation8 , but it is the de facto industry standard for searching. So it isnt surprising that RavenDB is using it, and doing
quite well with it. Well get to the details about how to use Lucenes capabilities
in RavenDB on the next chapter, now I would like to talk about how we are
actually using Lucene in RavenDB.
I mentioned that successfully running Lucene in production is somewhat of a
hassle for operations. This has to do with several reasons:
Lucene needs to occasionally compact its les (a process called merge).
Controlling how and when this is done is key for achieving good performance when you have a lot of indexing activities.
Lucene doesnt do any sort of veriable writes. If the machine crash
midway through, you are open for index corruption.
Lucene doesnt have any facility for online backup process.
Optimal indexing and querying speeds depend a lot on the options you
use and the exact process in which you work.
7 Ill also take this opportunity to thank Tobias Grimm, who was great help nding those
kind of issues
8 It is fairly involved to run, from operations point of view.
129
System . S trin g Co mp a r i s o n . I n v a r i a n t C u l t u r e I g n o r e C a s e )
s e l e c t new {
u s e r . Name ,
u s e r . Email ,
__document_id = u s e r . __document_id
});
this
this
this
this
this
this
this
this
this
. AddField ( __document_id ) ;
. AddField ( Name ) ;
. AddField ( Email ) ;
. AddQueryParameterForMap ( __document_id ) ;
. AddQueryParameterForMap ( Name ) ;
. AddQueryParameterForMap ( Email ) ;
. AddQueryParameterForReduce ( __document_id ) ;
. AddQueryParameterForReduce ( Name ) ;
. AddQueryParameterForReduce ( Email ) ;
}
}
All indexes inherit from the AbstractViewGenerator class. And the actual indexing work is done in the lambda passed to the AddMapDenition call. You
can see how we changed the index denition. The docs.Users call, which looks
like a collection reference was changed to the more accurate where statement,
which lter unwanted items from dierent collections. You can also see that
in addition to the properties that we have for indexing, we also include the
__document_id property.
Note that we keep the original index denition in the ViewText property (mostly
for debug purposes), and that we keep track of the entity names each index covers. The latter is very important for optimizations, since we can make decisions
based on this information (which documents we can not send to this index).
The RavenDB Indexing Language
On the surface, it looks like the indexing language RavenDB is using
is C# LINQ expressions. And that is true, up to a point. In practice,
we have taken the C# language prisoner, and made it jump through
many hoops to make our indexing story easy and seamless.
The result isnt actually plain C#, for example, there are no nulls.
Instead, we use the Null Object pattern to avoid dealing with NullReferenceExceptions. Another change is that the language we use
isnt strongly typed, and wont error on missing members.
All of that said, you can actually debug a RavenDB index in Visual
Studio, because however much we twist the languages arm, we end
up compiling to C#.
131
6.10 Summary
Weve gone over the details of how RavenDB indexing actually work. Hopefully
not in a mind numbing detail. Those details are not important during your
work with RavenDB, all of that is very well hidden under the covers, but it is
good to know how RavenDB will respond to changing conditions.
We started talking about the logical view of indexing in RavenDB. How an
indexing function output index entries that will be stored in a Lucene index,
and how queries will go against that index to nd the matching document ids,
which will be pulled from document storage. We then talked about incremental
indexing and the conceptual process of how RavenDB actually index documents.
From the conceptual level, we moved to the actual implementation details, including the set of tradeos that we have to make in indexing between I/O, CPU
and memory usage. We looked at how we deal with each of those issues. Optimizing I/O by prefetching documents and batching writes. Optimizing memory
by auto tuning the batch size and optimizing CPU usage by parallelizing work
(but not too much).
We also talked about what actually gets indexed, and how we optimize things
so an index doesnt have to go through all documents, only those relevant for
it. Then we talked about the new index creation strategies, how we try to make
sure that this is as ecient as possible while still letting the system operate
normally.
We got to talking a bit about Lucene, how we actually manage the index and
safe guard from corruption and handle recovery. In particular by managing the
state of the index outside of Lucene, but by checking to see the recovered state
in case of a crash.
6.10. SUMMARY
133
We concluded the chapter by talking about the actual code that gets run as
part of the index, error handling and recovery during indexing and the details
of index priorities and why they are setup the way they are.
I hope that this peek behind the curtain doesnt make you lose any faith in the
magical properties of RavenDB, pay no attention to the man behind the screen,
as the Wizard said. Even after knowing how everything work, it still seems
magical to me. And one of the most magical features in RavenDB is the topic
of the next chapter, how RavenDB allows ad-hoc queries by using automatic
indexing and the query optimizer.
Chapter 7
137
Figure 7.1: The RavenDB query optimizer likes to chase down queries and send
them to the right indexes.
139
So the query optimizer decided to create such an index. And hence, an index
is born. The proud parent watched over the new index, ensuring that it does
its job properly, and nally released it to the wild, to roam free and answer any
queries that would be directed its way.
Ad hoc queries werent supposed to be there
When RavenDB was just starting (late 2010), we already had a user
base and a really cool database. What we didnt have was ad hoc
queries. If you wanted to query the database, you had to write an
index to do so, and then you had to explicitly specify which index
would be used in each query. That was a hassle, but there was really
no other way around that. We didnt want to do anything that
would force a table scan, and there was no other way to support this
feature.
Then Rob Ashton pops in the mailing list and start sending us crazy
bug reports with very complex map/reduce indexes1 . And he start
making silly proposals about dynamic queries, stu that would obviously not work.
The end result was that I asked him to show me some code, with
the fully expectation that I would never hear from him again.
He came back a day later with a functional proof of concept. After
I managed to pick my jaw o the oor, I was able to gure out what
he was doing and got very excited2 .
Once we had the initial idea, we basically took it up and run with
it. And the result is a very successful feature and this chapter, too.
Leaving aside the anthropomorphism of the query optimizer, what is going on
is that the query optimizer reads the query, and try to match a relevant index
for it. If it cant nd a relevant index, it will create the index that can answer
this query. It will then start executing the query against the index. Because
indexing can take some time, it will wait until the query has enough results to
ll a single page, or 15 seconds has passed (or it completed indexing, of course)
before it will return the results to the user3 .
141
143
You can ask the query optimizer what it was thinking, to give a particular index the chance to run a particular query. You can do that
using the following REST call: GET /databases/Northwind/indexes/dynamic/Products?query=Category:categories/2&explain=true
/databases/Northwind - the database we use
/indexes/dynamic/Products - dynamic query on the Products
collection
?query=Category:categories/2 - querying all products in a particular category
&explain=true - explain what you were thinking.
This can be helpful if you want to understand a particular decision,
although most of the time, those are self-evident.
When the Auto/Products/ByCategoryAndDiscontinuedAndSupplier index is
up to date, we will start using only that index. And no queries will go to
Auto/Products/ByCategoryAndDiscontinued. At that point, the self-cleaning
features of the query optimizer will come into play, and because this index isnt
in use any longer, it will be demoted to idle and then deleted or abandoned,
depending on age.
The query optimizer does quite a bit of work, but we have only seen part of it.
We looked at how it managed the indexes, but now let us look at the kind of
145
147
from l i n e i n o r d e r . L i n e s
where l i n e . D i s c o u n t >= 5
s e l e c t order ;
Vs.
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( l => l . D i s c o u n t >= 5 )
s e l e c t order ;
The rst query and the second are conceptually the same, but the second one
uses an Any method, instead of the multiple from clauses (or SelectMany, if you
are using the method syntax). Conceptually the same, in the sense that this
will result in the same ltering going on. But the output of those queries are
very dierent.
In the rst query, youll get the order back as many times as you have lines in
the order, while in the second query youll get the order exactly once. That, and
the exploding complexity of trying to parse arbitrary LINQ queries, has caused
us to limit ourselves to the simpler syntax.
Group by clauses, let clauses, multiple from clauses, or join clauses in LINQ are
also not supported for queries. They dont make a lot of sense for a document
database, and while we have better alternative for those, the syntax exposed by
LINQ doesnt make it possible to expose them easily.
Ordering, however, is fully supported, as you can see in the following query:
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( l => l . D i s c o u n t >= 5 )
o r d e r b y o r d e r . ShipTo . City d e s c e n d i n g
s e l e c t order ;
It is important to remember that in RavenDB, querying is done on each document individually, it isnt possible to query a document based on another
documents properties. Well, to be rather more exact, that is certainly possible,
but that isnt done via dynamic queries. Well touch on that as well in the next
chapter.
7.2.4 Projections
Consider the following query, what would be its output?
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Employee = employees /1
s e l e c t order ;
Well, it is pretty obvious that we are going to get the full order document back.
And this is great, if we wanted the full document. But a lot of the time, we
just want very few properties back from the query results, just enough to show
them on a grid. How would that work? About as simply as you can think, see:
149
f o r e a c h ( var o r d e r i n o r d e r s )
{
C o n s ol e . WriteLine ( { 0 } \ t {1}\ t { 2 } , o r d e r . Id , o r d e r . CompanyName , o r d e r
}
Using this method, we are able to pick just the data that we want, and only
send the relevant details to the client. There are a few important aspects to
note here. The TransformWith method call takes a query, and operates over the
results of this query. Note that even actions that appear to happen later (like
the Take(10) call) will be applied before the transformer is run.
In other words, on the server side, well only get 10 orders to work through inside
the transformer. That means that well only need to run the transformation (and
load the associated company) 10 times.
Another result of this decision is that by the time the transformer is called, all
the paging and the sorting has already been done for it, so while it can apply
its own ltering and sorting, the data it will received has been processed.
It is very common to use transformers for getting the data to show whenever you
have a grid or a table of some sort. You can pick just the data you want (or even
do this dynamically), and you can even merge data from multiple documents, if
that makes sense.
All of that assumes that you have relatively large documents, and that you only
want to show a few of their properties, if you actually need the whole thing. If
that doesnt hold true, just load the entire document.
151
153
7.4 Summary
In this chapter, we started by taking the role of the query optimizer and seeing
all the things that it is doing. From managing queries to generating indexes on
the y to removing indexes on the y and in general taking care of our indexes
in general.
The query optimizer takes care to do this in a way to result in an optimal system,
but merging common indexes and remove old and unused ones. After going over
the query optimizer details, we go down to business and looked at how dynamic
queries actually worked.
We queried simple properties, and then values stored inside collections, then we
looked at the type of queries that can be made using dynamic queries. And
that cover quite a lot of ground. We can do equality and range comparisons,
compare a value to a list to nd if it is in that or create complex queries by
combining multiple clauses.
Following queries, we moved to what we can do with the data they give us, and
we looked into using Include in queries. This is where they really shine, since
you can drastically reduce the number of remote calls that you have to make.
Even that wasnt enough for us, so we looked into reduce the amount of data we
send over the wire by using projections, and when that isnt enough, we have
transformers in our toolbox as well.
Using transformers we can pull data from multiple documents, if needed, and
get just the data that we need. This give us a way to ne tune exactly what we
need, reducing both the number of remote calls and the size on the wire.
Finally, we looked at how queries are actually implemented. We inspected
DocumentQuery and saw that we can use it for fully dynamic queries, without
requiring any sort of predened type, and that this is what the RavenDB LINQ
7.4. SUMMARY
155
API is actually using under the covers. We looked at the result of such queries,
using the Lucene syntax and then explored how the query optimizer is using
specic property path syntax to know how to get to the data.
Coming next, well start learning about static indexes in RavenDB, which is
quite exciting, since this is where a lot of the really cool stu is happening.
Chapter 8
Static indexes
In the previous chapter, we talked about dynamic indexes and the query optimizer. For the most part, those serve very well to free you from the need to
manually deal with indexing. Pretty much all of the standard queries can be
safely handed over to the query optimizer and it will take care of that.
That leave us to deal with all of the non standard queries. Those can be full
text search, spatial queries, querying on dynamic elds or any real complexity.
That isnt to say that dening static indexes in RavenDB is complex, only that
they are required when the query optimizer cant really guess what you wanted.
By the end of this chapter, youll know how to create static index in the stuido
and in your code, how to version them in your source control system and how
to make RavenDB sit down, beg and (most importantly) go and fetch the right
documents for you.
158
This kind of index isnt really useful, of course. We are usually better o letting
the query optimizer handle such simple indexes, rather than write such trivial
indexes ourselves. We demonstrate with such an index because it allows us to
talk about the mechanics of working with indexes without going into any extra
details.
With RavenDB, you dont have to dene a schema, but in many cases, your code
will expect certain indexes to be available. How do you manage your indexes?
One way of doing that is to do so manually. You can either export just the
index dentions from the development server and import them to the production
server. Another would be to use the copy/paste index dention from one server
to the next. Or you can just jot it down on using pen & paper and remember
to push those changes to production.
Of course, all those options have various issues, not the least of which is that they
are manual processes, and that they dont tie down to your code and the version
that it is expecting. This is a common problem in any schema management
system. You have to take special steps to make sure that everything is in sycn,
and you need to version and control your changes explicitly.
That is the cause for quite a lot of pain. When giving talks, I used to ask people
how many of them ever managed to get a failed deployment because they forgot
to run the schema changes. And I got a lot of hands raising, every single time.
With RavenDB, we didnt want to have the same issues. That is why we have
the option of managing indexes in your code.
159
160
Now that we know how to work with indexes in our code, we need to upgrade
from writing trivial index to writing useful stu.
162
By moving all computation from query time (expensive, often require to run
through the entire data set on every query) to indexing time (cheap, happen
only when a document changes), we are able to change the costs associated with
queries. Even complex queries always end up being some sort of a search on the
index. That is one of the reasons why we usually dont have to deal with slow
queries in RavenDB. There isnt anything that would cause them to be slow.
The
The
The
The
documents to be indexed.
index entry is the output from the indexing function.
query model is how we can query on the index.
query result is what is actually returned from the index.
This is a bit confusing, but it can be quite powerful. Let us see why we have so
many models, rst.
The dierence between documents and index entries is obvious. We can see the
dierence quite clearly in Listing 8.2. The document doesnt have a Total eld,
and the index is computing that value and output that eld to the index. Thus,
the index entry for orders/1 has a Total eld with a value of 440.
Because we have this dierence between the document that was indexed and
the actual index entry, we also have a dierence between how we query and
what is actually returned. Look at Listing 8.3. We start the query using:
Query<Orders_Totals.Result, Orders_Totals>().
The rst generic parameter Orders_Totals.Result is the query model, and the
second is the index to use. The query model is usually also the query result,
since most times this is the same thing. In this case, however, we need to query
on the Total eld, which does not exist on the document.
As we discussed in Chapter 6, the way queries work in RavenDB is that we
rst run the query against the index, which gives us the match index entries.
We then take the __document_id property from each of those matching index
entries and use that to load the relevant documents from the document store.
This is exactly what is going on in Listing 8.3. We start by using the query model
Orders_Totals.Result, but what we are actually getting back from the server
is the list of matching orders. Because of that, we need to explicitly change
164
We arent treating this as a collection, but just as a simple value. This seems
silly. Why go into all this trouble and introduce a whole new model when we
could have just called Contains() and call it a day?
This behavior can be very helpful when you have a more complex index, such
as Listing 8.5.
Listing 8.5: Index for searching employees by name or territory
p u b l i c c l a s s Employees_Search :
A b s t r a c t I n d e x C r e a t i o n T a s k <Employee , Employees_Search . R es u l t >
{
public c l a s s Result
{
p u b l i c s t r i n g Query ;
}
p u b l i c Employees_Search ( )
{
Map = employees =>
from employee i n employees
s e l e c t new
{
Query = new o b j e c t [ ]
{
employee . FirstName ,
employee . LastName ,
employee . T e r r i t o r i e s
}
};
}
}
In Listing 8.5, the output of the index for employees/1 would be:
Query : [ D a v o l i o , Nancy , [ 0 6 8 9 7 , 1 9 7 1 3 ] ]
Note that we dont have a simple array, but an array that contain strings and
and an array of strings. We cant just call Contains() on the Query eld. But
because RavenDB will atten out collections, we can query this index using the
following code, and well get the employees/1 document:
s e s s i o n . Query<Employees_Search . R e s u l t , Employees_Search >()
. Where ( x=>x . Query == 0 6 8 9 7 )
. OfType<Employee >()
. ToList ( ) ;
This ability can be extremely useful whenever we want to use an index for
searching. Well cover this in a lot more on the next chapter.
166
So far, we looked at simple indexes. An index that have only a single map
function in the index denition, and is operating on a single collection. But
RavenDB actually allows us to do much more, using multi map indexes.
167
168
8.5. PROJECTIONS
169
8.5 Projections
Projections are a way to collect several elds from document, instead of working
with the whole document. In the case we have now, we already know the shape
of the data that we want to deal with. It is the same common shape already
dened in the index. The multi map index enforce that all map functions will
have the same output, and we can use that when the time comes to query the
index. In the case of the People/Search index, that means that we are going
to be showing a list of names, the type of the document and its id. Listing 8.7
shows the changes required to make this work.
Listing 8.7: Multi map index that allows projections
p u b l i c c l a s s People_Search :
AbstractMultiMapIndexCreationTask<People_Search . R e s u l t >
{
public c l a s s Result
{
p u b l i c s t r i n g Type { g e t ; s e t ; }
p u b l i c s t r i n g Id { g e t ; s e t ; }
p u b l i c s t r i n g Name { g e t ; s e t ; }
}
p u b l i c People_Search ( )
{
AddMap<Company>(companies =>
from company i n companies
s e l e c t new
{
company . Id ,
company . Contact . Name ,
Type = Company
});
AddMap<Employee >( employees =>
from employee i n employees
s e l e c t new
{
employee . Id ,
Name = employee . FirstName + + employee . LastName ,
Type = Employee
});
AddMap<S u p p l i e r >( s u p p l i e r s =>
from s u p p l i e r i n s u p p l i e r s
s e l e c t new
{
170
}
The rst thing to note is that we now dene an inner class called Result, and
reference that as the generic argument. The generic argument doesnt have to
be an inner class, but that is a common convention, because it ties the index
and its exepcted result together. We have seen used before for queries, but now
we are going to use this for getting the relevant data:
var r e s u l t s = s e s s i o n . Query<People_Search . R e s u l t , People_Search >()
. Where ( x=>x . Name == Michael )
. P r o j e c t F r o m I n d e x F i e l d s I n t o <People_Search . R e s u l t >()
. ToList ( ) ;
What is going on in this piece of code? First, we tell the RavenDB Linq Provider
that well be querying the People/Search index, and that well be using the
People_Search.Result class as the query model. Then we actually specify the
query, and nally, we ask RavenDB to project the result from the index. What
does this mean?
Look at the last lines of Listing 8.7, we have a few Store() calls there. Usually,
the index doesnt bother to keep around any data beyond what it needs to
actually answer a query. But because we told it to store the information, it
wont only index the data, but also allow us to retreive the information directly
from the index.
Usually queries in RavenDB have the following workow:
When we use projections, the workow is dierent, instead of getting just the
document ids from the index, well also get the projected elds. That is why we
have to store them in the index. That means that well query the index, load
the results directly from it, and immediately return to the client.
171
And in this case, well return a list of (Id, Name, Type) to the client, and the
client can show it to the user, who will perform any additional actions on them.
What happens if we project a non stored eld?
If you project on a eld that isnt stored, RavenDB will load the
document from the document store, and then get the eld from the
document directly. You get the correct result, but you incur the cost
of actually loading the document from disk, although it wont be
sent over the network. For small documents, that is hardly anything
major, but for large documents, that might be something that you
want to pay attention to.
Projections arent limited to multi map indexes, you can use them in any index,
and they are frequently quite useful. They also go hand in hand with transformers, which also allows you to limit the amount of data that you get from the
server. A transformer running on an index will rst try to nd a stored eld,
and only if it cant nd the stored eld will it load the document.
Because projections are usually based on the data in the index, they are subject
to the same staleness consideration. If you load the data directly from the
document (whatever by loading the document itself or by projecting from the
document) you are ensured that youll always have the latest version at the time
of the query. If you are projecting data stored in the index, it is possible that
the data on the document has changed since then.
You dont have to store all the elds in the index, you usually store just the
elds that you care to project out. Linq queries such as the following one are
also using projections under the cover:
from o r d e r i n s e s s i o n . Query<Order >()
s e l e c t new
{
o r d e r . Company ,
o r d e r . OrderedAt ,
}
This query will project the Company and OrderedAt elds, such queries are the
reason why we fallback to the document if the index doesnt already store those
elds.
172
be used carefully. It isnt meant to be the hammer that youll use to beat every
problem into submission. Scared yet?
I do admit that this is quite an introduction for a feature that we havent even
discussed. But we have had a lot of problems with users tripping themselves over
this feature, and most this has been because they came to it from a relational
mindset. So I would ask that youll read this section to understand the technical
nature of this feature, but refrain from using it until you have read the Part
III - Modeling and can understand the proper way to design a document based
modeling with RavenDB.
With this ominous introduction out of the way, let us see what this feature is
all about. And it is a really cool feature. Load Document allows you to load
another document during index, and use its data in your index. Let us take a
simple example, we want to search for product by its category. You can see how
product store the category information in Figure 8.2.
173
Map = p r o d u c t s =>
from p r o d u c t i n p r o d u c t s
s e l e c t new
{
CategoryName = LoadDocument<Category >( p r o d u c t . Category ) . Name
};
}
}
What is going on in Listing 8.8? We are calling LoadDocument<Category>(product.Category)
in the map, loading the relevant category and indexing its name. That means
that we can now query for the products in the Beverages category using the
following code:
var r e s u l t s = s e s s i o n . Query<Products_Search . R e s u l t , Products_Search >()
. Where ( p => p . CategoryName == B e v e r a g e s )
. OfType<Product >()
. ToList ( ) ;
As we discussed previously in this chapter, because we have a dierent model for
the shape of the index and the shape of the result, we start the query with the
Products_Search.Result and then use OfType<Product> to change the Linq
result to the appropriate returned type.
Now, what is actually going on here? Let us consider the case of the products/70
document, being indexed by the Products/Search index. During index, we
call the LoadDocument method, which will then fetch the categories /1 and
index the category name in the index for the products/70 document. Now
we can search for CategoryName: Beverages, and get (among others) the
products/70 document.
So far, so good, and a really useful feature. But let us get down to the nitty gritty
details. Because this happens during indexing, what happens if the relevant
category is null, or if it it has a value in that eld is of a non existant document
id? Like any other null handling in RavenDB handling, this is handled for you,
and you dont need to write null checks or guard against that.
A more interesting problem happens when we deal with changes. Given the
previously mentioned products/70 and its associated categories/1. The index
entry for the products/70 looked like the following:
{ CategoryName :
Beverages , __document_id :
products /70 }
What happens when we are update the categories/1 document? For example,
to change the name from Beverages to Drinks? Documents are only re-indexed
when they are changed. So we would expect that since products/70 didnt
change, we wont have this index entry update, and it will reect the state of
the data at the time of the products indexing.
174
RavenDB is smart enough to recognize that this would turn this feature into a
curiosity, nothing more. Because of that, RavenDB create an internal association between the two documents. You can think about it as an internal table
that associates products/70 with categories/1. Whenever we put a document,
we check those associations, and we touch all the referencing documents, which
will force their re-indexing.
Who touched my documents?
Touching a document usually happens whenever we detect that a
document that has been referenced by this document during index
has changed, so the document need to be reindexed (to pick up the
new values from the referenced document). Touching a document
involves updating its etag value, which will force it to be re-indexed.
175
176
The call of Load Document is a tempting one, Come to Dark Side, we have
cookies. And it is a very powerful feature, but it is something that you should
be using after carefully understanding the implications of that.
The most common usage of Load Document unfortunately is when users model
documents in RavenDB in the same manner that they did when using a relational database, failing to understand that a very dierent database engine
require a dierent approach. Well discuss this in great length in the modeling
part of the book.
8.8 Queries
We have gone over a lot of the details in static indexes, and we are almost done
with this topic. Before we go on to talk about full text search and map/reduce,
I wanted to go over potential queries options in RavenDB. This isnt meant to
be an exhaustive list (see the documentation for that), but it should give you a
good idea about the kind of querying capabilities that RavenDB has.
A lot of that is probably obvious, and you can skim through the rest of this
section without losing much.
8.8. QUERIES
177
var r e s u l t s =
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . F r e i g h t >= 25
s e l e c t order ;
Those, too, are using the index rather than compare all the values. Note that
we have implemented special behavior for numerics, so we can compare them
without incuring 3 is greater than 10 issues when doing lexical comparisons of
numbers.
178
This is translated to the following query: FirstName: Marg*. The same can
be done in reverse, by using EndsWith. However, that is no advisable. Using
prex query, we can make a good use of our indexes to ecently answer the
query. The same is not true for EndsWith, that requires us to scan the entire
index. A better alternative if you need to query using EndsWith is to create
an index with the reversed value and use the much more ecent StartsWith on
that.
8.8. QUERIES
179
We can issue a query for those companies using the following code:
var r e s u l t s =
from c i n s e s s i o n . Query<Company>()
where c . Address . Country . In ( Spain , Denmark , S w i t z e r l a n d )
select c ;
And the resulting query would be @in<Address.Country>:(Spain,Denmark,Switzerland).
Note that this is using a RavenDB specic extension to the Lucene query
syntax to provide an ecent search mechanism for potentially large lists.
Finally, the last advanced query operation well deal with in this chapter is Any,
which is used to expressed nested queries. The question we pose is this, nd all
orders that have a line item with a specic product.
var r e s u l t s = from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( x=>x . ProductName == Milk )
s e l e c t order ;
And the generated query is: Lines,ProductName:Milk. Pay attention to the
comma operator in the query. This instructs RavenDB that the Lines property is
a collection, and we are nesting into the properties of the items in that collection.
There is one very important detail that you need to be aware of. How multiple
predicates are treated in this situation. Let us consider the following Linq query:
var r e s u l t s =
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( x => x . ProductName == Milk && x . Quantity == 3 )
s e l e c t order ;
It asks for all orders that has a line item with the name Milk and a quantity of
3. However, the generated query would be: Lines,ProductName:Milk AND Lines,Quantity:3.
What eectively happens is that as far as RavenDB is concerned, the query we
issued is:
where o r d e r . L i n e s . Any( x => x . ProductName == Milk ) &&
o r d e r . L i n e s . Any( x => x . Quantity == 3 )
In other words, there is no guarantee that the match on the ProductName is
Milk and the Quantity equaling 3 come from the same line item. Why is that?
In order to understand the reasoning behind this, we need to look at the auto
generated index for this query:
// Auto/ Orders / ByLines_ProductNameAndLines_Quantity
from doc i n d o c s . Orders
s e l e c t new {
Lines_ProductName =
180
}
In other words, we are attening all the line items into a single index entry. This
is done to avoid a fanout, a situation where a single document outputs a very
large number of index entries. A big fanout can cause RavenDB to consume a
lot of memory and I/O during indexing, so we try to avoid it.
Advanced query operations and dynamic vs. static indexes
Usually, there isnt any dierence between queries made against
a dynamic index or a static index. But as you can see in the
Auto/Orders/ByLines_ProductNameAndLines_Quantity
index,
RavenDB will make specic transformations when making certain
queries (specically, Any queries, using the comma operator). So
a query such as Lines,Quantity:3 will be translated to a query on
the Lines_Quantity eld. If you want to make such queries on your
static indexes, you are probably better o matching the expected
RavenDB naming.
There are conventions that control this behavior that you can tune
ot your own desires, but it is usually much easier to just follow the
same behavior as the rest of RavenDB.
You can dene your own static index, to answer this query, which will have
a separate index entry per line item, allowing you to query the quantity and
product name of a specic line item. In which case you can be explicit about
the number of items that you want to allow in your fanout.
8.9 Summary
In previous chapters, we have dealt a lot with the indexing mechanisms. How
RavenDB index documents, how dynamic queries work, etc. In this chapter, we
have explored the various options that we have when creating our indexes, and
how that aect the sort of queries that we can make.
We started by looking into how we are actually going to create, maintain and
manage indexes in RavenDB, dening indexes in your code allows us to version
them alongside your code, in the same source control system, and using the
same tools.
Afterward, we moved to creating complex indexes and moving the cost of computations from query time to indexing time, one of the ways in which we are
8.9. SUMMARY
181
able to perform super fast queries with RavenDB. We talked about the document model, the indexing model, the query model and the result model, and
how using slightly dierent models allows us to create some pretty impressive
results.
Multi map indexes were the next topic at hand, allowing us to index multiple
collections into a single index and query them in an easy manner. Following
multi map, we discussed projections and the general process of loading the data
from an index and storing information in the index.
We discussed the LoadDocument feature, its stengths and its weaknesses. This
feature allows us to index related documents, but at a potentially heavy cost
during indexing. We concluded this chapter by going over the querying options
that we have, we looked at equality and range comparisons, complex queries,
the cost of prex and postx searches and advanced query operations such as
Contains, In, and performing nested queries using Any.
That is quite a lot to digest, and I recommend that youll spend some time
playing around with the indexing and querying options that RavenDB oers,
to get yourself familiar with the range of tools that it opens up. In the next
chapter, we are going to talk about full text searching, making full use of Lucene
and its capabilities. After that, well go on to talk about map/reduce and what
we can do with that.g
182
Chapter 9
183
184
Part III
In this part, well learn about scale out:
Replication
Sharding
Reporting & SQL Replication
185
186
Part IV
In this part, well learn about operations:
Monitoring ?
Endpoints ?
Troubleshooting
Security
187