Vous êtes sur la page 1sur 187

Inside RavenDB 3.

0
Oren Eini
Hibernating Rhinos

Contents
1 Introduction

1.1

What is this? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.2

Who is this for? . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.3

In this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3.1

Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3.2

Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3.3

Part III . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3.4

Part IV . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.3.5

Part V . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Distributed computing . . . . . . . . . . . . . . . . . . . . . . . .

12

1.4.1

12

1.4

Fallacies . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part I

13

2 A little history

15

2.1

2.2

The back story . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.1.1

All the same mistakes . . . . . . . . . . . . . . . . . . .

16

2.1.2

Maybe tooling will help? . . . . . . . . . . . . . . . . . . .

17

2.1.3

Think dierent (database) . . . . . . . . . . . . . . . . .

17

2.1.4

Setting RavenDB free . . . . . . . . . . . . . . . . . . .

18

The case for a non relational data store. . . . . . . . . . . . . . .

19

2.2.1

Normalization, compression and data corruption . . . . .

20

2.2.2

Unclear consistency boundaries . . . . . . . . . . . . . . .

22

CONTENTS

2.3

2.4

2.2.3

It isnt optimized for your scenarios . . . . . . . . . . . .

23

2.2.4

Change is hard, lets not do that . . . . . . . . . . . . . .

23

2.2.5

Scale? Sure, get a bigger box . . . . . . . . . . . . . . . .

23

Making sense in the NoSQL menagerie . . . . . . . . . . . . . . .

25

2.3.1

Key/Value databases . . . . . . . . . . . . . . . . . . . .

25

2.3.2

Graph databases . . . . . . . . . . . . . . . . . . . . . .

25

2.3.3

Column databases . . . . . . . . . . . . . . . . . . . . .

26

2.3.4

Document databases . . . . . . . . . . . . . . . . . . . .

26

2nd generation document database . . . . . . . . . . . . . . . . .

27

2.4.1

27

Let me do my job, and youll do yours . . . . . . . . . . .

3 Zero to 60 with RavenDB, from installation to usage

31

3.1

Setting up everything . . . . . . . . . . . . . . . . . . . . . . . .

31

3.2

Coding with RavenDB . . . . . . . . . . . . . . . . . . . . . . . .

34

3.3

The basics of the client API . . . . . . . . . . . . . . . . . . . . .

36

3.3.1

The document store . . . . . . . . . . . . . . . . . . . . .

36

3.3.2

The session . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.3.3

Database commands . . . . . . . . . . . . . . . . . . . . .

52

3.3.4

Working with async . . . . . . . . . . . . . . . . . . . . .

53

Whatcha doin with my data? . . . . . . . . . . . . . . . . . . . .

54

3.4.1

Where is the data actually stored? . . . . . . . . . . . . .

55

3.5

Running in Production . . . . . . . . . . . . . . . . . . . . . . . .

56

3.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.4

4 RavenDB concepts

59

4.1

Entities, Aggregate Roots and Documents . . . . . . . . . . . . .

59

4.2

Collections

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.3

Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

4.4

Document Identiers . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.4.1

Dont use Guids . . . . . . . . . . . . . . . . . . . . . . .

64

4.4.2

Human readable identiers . . . . . . . . . . . . . . . . .

64

4.4.3

High/low algorithm . . . . . . . . . . . . . . . . . . . . .

65

CONTENTS

4.4.4

Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.4.5

Semantic ids . . . . . . . . . . . . . . . . . . . . . . . . .

71

4.4.6

Working with document ids . . . . . . . . . . . . . . . . .

72

4.5

ETags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

4.6

Optimistic Concurrency Control . . . . . . . . . . . . . . . . . .

75

4.6.1

End-to-end Optimistic concurrency . . . . . . . . . . . . .

77

Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

4.7.1

HTTP Caching . . . . . . . . . . . . . . . . . . . . . . . .

80

4.7.2

Aggressive Caching . . . . . . . . . . . . . . . . . . . . . .

81

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.7

4.8

5 Advanced Client API Usage

85

5.1

Lazy is the new fast . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.2

Unbounded Results Set Prevention . . . . . . . . . . . . . . . . .

87

5.3

Streaming results . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

5.4

Bulk inserts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.5

Partial document updates . . . . . . . . . . . . . . . . . . . . . .

95

5.5.1

Simple Patch API . . . . . . . . . . . . . . . . . . . . . .

96

5.5.2

Scripted patching . . . . . . . . . . . . . . . . . . . . . . .

98

5.5.3

Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.6

Listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.7

The Serialization Process . . . . . . . . . . . . . . . . . . . . . . 104

5.8

Changes() API . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.9

Result transformers . . . . . . . . . . . . . . . . . . . . . . . . . . 107


5.9.1

Load Document . . . . . . . . . . . . . . . . . . . . . . . . 108

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


Part II

111

CONTENTS

6 Inside the RavenDB indexing implementation

113

6.1

How indexing works in RavenDB? . . . . . . . . . . . . . . . . . 114

6.2

Incremental indexing . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.3

The indexing process . . . . . . . . . . . . . . . . . . . . . . . . . 117


6.3.1

Throughput vs. Latency . . . . . . . . . . . . . . . . . . . 119

6.3.2

Parallel indexing . . . . . . . . . . . . . . . . . . . . . . . 121

6.3.3

Auto tuning batches . . . . . . . . . . . . . . . . . . . . . 122

6.3.4

What gets indexed? . . . . . . . . . . . . . . . . . . . . . 123

6.3.5

Introducing a new index . . . . . . . . . . . . . . . . . . . 123

6.3.6

I/O Considerations . . . . . . . . . . . . . . . . . . . . . . 125

6.4

Index safeties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.5

Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.6

Transformations on the indexing function . . . . . . . . . . . . . 128

6.7

Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.8

What about deletes? . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.9

Indexing priorities . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132


7 The query optimizer and dynamic queries
7.1

7.2

7.3

The Query Optimizer . . . . . . . . . . . . . . . . . . . . . . . . 136


7.1.1

The life cycle of automatic indexes . . . . . . . . . . . . . 139

7.1.2

Index merging . . . . . . . . . . . . . . . . . . . . . . . . 140

7.1.3

Dynamic index selection . . . . . . . . . . . . . . . . . . . 142

Complex dynamic queries . . . . . . . . . . . . . . . . . . . . . . 143


7.2.1

Querying collections . . . . . . . . . . . . . . . . . . . . . 143

7.2.2

Querying capabilities . . . . . . . . . . . . . . . . . . . . . 145

7.2.3

Includes in queries . . . . . . . . . . . . . . . . . . . . . . 147

7.2.4

Projections . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7.2.5

Result Transformers . . . . . . . . . . . . . . . . . . . . . 149

DocumentQuery vs. LINQ queries . . . . . . . . . . . . . . . . . 151


7.3.1

7.4

135

Property Paths . . . . . . . . . . . . . . . . . . . . . . . . 153

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

CONTENTS

8 Static indexes

157

8.1

Creating & managing indexes . . . . . . . . . . . . . . . . . . . . 157

8.2

Dening indexes in your code . . . . . . . . . . . . . . . . . . . . 158

8.3

Complex queries and doing great things with indexes . . . . . . . 160

8.4

8.3.1

The many models of indexing . . . . . . . . . . . . . . . . 162

8.3.2

The purity of indexes . . . . . . . . . . . . . . . . . . . . 165

Multi map indexes . . . . . . . . . . . . . . . . . . . . . . . . . . 166


8.4.1

Multi map indexes from the client perspective . . . . . . . 168

8.5

Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

8.6

Load Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.7

The dark side of Load Document . . . . . . . . . . . . . . . . . . 174

8.8

8.9

8.7.1

The commonly referenced & updated document . . . . . . 174

8.7.2

Load Document costs and heuristics . . . . . . . . . . . . 175

Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.8.1

Equality and comparisons . . . . . . . . . . . . . . . . . . 176

8.8.2

Query clasues . . . . . . . . . . . . . . . . . . . . . . . . . 177

8.8.3

Prex and postx searches . . . . . . . . . . . . . . . . . 177

8.8.4

Contains, in and nested queries . . . . . . . . . . . . . . . 178

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

9 Full text search

183

9.1

Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.2

Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.3

Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.4

Stop words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.5

Facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.6

Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.7

Complete search example . . . . . . . . . . . . . . . . . . . . . . 183

Part III

185

Part IV

187

CONTENTS

Chapter 1

Introduction
RavenDB is a 2nd generation document database, part of the NoSQL approach.
That may or may not mean anything to you. If it doesnt, here is the elevator
speech. A document database is a database that stores documents. Not Word
or Excel documents, but documents in the sense of structured information in
the form of self-contained data. Usually, a document is in JSON or XML format.
RavenDB is a database for storing and working with JSON data.
RavenDB is a 2nd-generation database because weve been able to observe what
everyone has been doing and learn from their mistakes. So RavenDB has a
number of guidelines that resulted in a very dierent experience than is usually
associated with NoSQL databases. For example, RavenDB is an ACID database,
unlike many other NoSQL databases. Also, it was designed explicitly to be easy
to use and maintain.
Im a developer at heart. That means that one of my favorite activities is writing
code. Writing documentation, on the other hand, is so far down the list of my
favorite activities that one could say it isnt even on the list. I do like writing
blog posts, and Ive been maintaining an active blog for over a decade now.
Documentation tends to be dry and, while informative, that hardly makes for
good reading (or interesting time writing it). RavenDB has quite a bit of documentation that tells you how to use it, what to do and why. This book isnt
about providing documentation; weve got plenty of that. A blog post tells a
story even if most of mine are technical stories. I like writing those, and it
appears that a large number of people also like reading them.
9

10

CHAPTER 1. INTRODUCTION

1.1 What is this?


This book is eectively a book-length blog post. The main idea here is that
I want to give you a way to grok 1 RavenDB. Not just what it does, but how
it does it and all the reasoning behind the curtain. In eect, I want you to
understand all the whys of RavenDB.
A blog post and a book have very dierent structure, audience and purpose,
however, so this cant really be just a very long blog post. But Im aiming to
give you the same feeling. Im not going to try for dry documentation, nor is
this meant to be a reference book. If you need either, you can read the online
RavenDB documentation.
The content of this book has evolved. It started out with the bare bones outline
of the RavenDB course we have been teaching for the past ve years. It grew
with the formalization of internal training we do for new team members. This
is a guided tour of RavenDB, largely with the aim of explaining how to use it,
but including common forays into understanding how and why RavenDB does
certain things.
By the end of this book, youre going to have a far better understanding of how
RavenDB is put together. More importantly, youll have the knowledge and
skills to make much more ecient use of RavenDB.

1.2 Who is this for?


Ive tried to make this book useful for a broad category of users. Developers
reading this book will understand how to best use RavenDB features to create
awesome applications. Architects will have the knowledge required to design and
guide large scale systems with RavenDB clusters at their core. The operations
team will know how to monitor, support and nurture your RavenDB instances
in production.
Regardless of who you are youll come away from this book with a greater
understanding of all the moving pieces and the ability to make RavenDB do as
you wish.
This book assumes that you have working knowledge of .NET / C#, though
RavenDB can be used for .NET, Java, node.js, Python, Ruby and PHP. Pretty
much everything discussed in the book is applicable, even if you arent writing
code in C#. If you are running on the JVM in particular (Java, Scala, Clojure
and Groovy), the client API is identical to the .NET one, so even the client side
knowledge is transferable.
1 Grok means to understand so thoroughly that the observer becomes a part of the observedto merge, blend, intermarry, lose identity in group experience. Robert A. Heinlein,
Stranger in a Strange Land

1.3. IN THIS BOOK

11

1.3 In this book


One of my major problems in writing this book was how to structure it. There
are many things that relate to one another to a degree that it is complex to
try to understand them in isolation. We cant talk about modeling documents
before we understand the kind of features that we have available for us to work
with, for example. Because of that, Im going to introduce things in stages.

1.3.1 Part I
Chapter 2 introduces RavenDB, non-relational document stores, and the background story for RavenDB. Not only the technical details about what it does,
but what led to its existence, and what was so important that we had to create
a whole new database for it. If you are familiar with NoSQL databases and
their history, you can skip this chapter and come back to it later.
Chapter 3 focuses on setting up a RavenDB server from scratch, then starting
to work with the database. From there, we discuss how to use the RavenDB
Client API and what sort of infrastructure your application can use when taking advantage of RavenDB. What RavenDB actually does with your data and
storage engine choices are also covered. Finally, we talk briey about running
in production (which is covered in greater depth in Part IV).
Chapter 4 discusses RavenDB concepts, ranging from introductions of entities
and documents to concepts of collections and metadata. We go over document
identiers in detail, including all the common strategies to generate a document
ID and the implication of each. From there, we discuss etags and their use in
RavenDB, including caching and optimistic concurrency control.
Chapter 5 explores advanced client-side operations. We begin by demonstrating how we can automate common tasks via listeners and then ne-tune the
serialization process. Next is a review of result streaming and bulk inserts, as
the preferred ways to port a lot of data out of and into RavenDB eciently.
From there, we talk about partial document updates (patching), lazy request
optimizations, change notications and the use of result transformers.

1.3.2 Part II
Indexing

1.3.3 Part III


Modeling

12

CHAPTER 1. INTRODUCTION

1.3.4 Part IV
Operations

1.3.5 Part V
Scale out

1.4 Distributed computing


1.4.1 Fallacies

Part I
In this part, well learn:

The background story of RavenDB and NoSQL in general.


How to set up and use RavenDB.
To understand RavenDB concepts.
How to use the Ravendb Client API eectively.

13

14

Chapter 2

A little history
This is a technical book, but I believe origin stories are as important as operation
or design details. And so, I would like to tell you the actual story about how
RavenDB began. Im going to be talking about myself quite a lot for this section,
but well resume talking about RavenDB immediately after, I promise.
You can skip this chapter and go directly to the technical content if you wish,
but I do suggest reading it at some point.

2.1 The back story


In September 2007, I was sitting in JAOO, listening to Joe Armstrong talk
about Erlang, distributed systems and 9 nines reliability. After, I bought his
Programming Erlang book and read it cover to cover. I thought the ideas
presented in both the talk and the book were fascinating, but just reading about
them in a book wasnt enough. Armed with just enough knowledge about Erlang
to be dangerous, I decided that I need something bigger.
At the time, CouchDB was one of the most cited Erlang projects, so I decided
that I would go over its code and learn how production Erlang code is written.
I blogged about the experience, so you can read my raw thoughts at the time.
And so, my rst real introduction to NoSQL was actually reading through the
codebase of a production quality database. It was a fascinating journey.
But I didnt (and dont) like the Erlang syntax. And I disagreed with several
aspects of the CouchDB approach. And that was how it stayed for quite some
time. Since 2004 or so, I have been dealing primarily with Object Relational
Mappers and relational databases. Im a core team member of the NHibernate
Project and Ive been a consultant for much of that time.
15

16

CHAPTER 2. A LITTLE HISTORY

Taken together, this means that ever since 2004, my job was largely to go to
clients improve the performance, stability, and scalability of their applications.
The problem with that was that at some point, I made a mistake. I was doing
a code review for two clients, and I sent each of them the report about the
issues found in the other clients code and how to resolve it. Mistakes like that
arent pleasant, but they happen. So, imagine my surprise when both the clients
didnt notice that they had a report about a completely dierent codebase before
I pointed it out. In fact, one client was actually fast enough to implement my
suggestions before I could get back to them. Although they did comment that
I was probably working on top of a dierent branch :-).

2.1.1 All the same mistakes


That led me to go back and review all of my work for the past several years,
and realize that I was really doing the exact same thing, over and over again. I
was working for clients that ranged from Fortune 100 to small start-ups, and in
diverse industries such as health care, social media, risk management and real
estate management. But the problems that the clients run into were the same.
And the solutions were often the same as well.
It got to the point where I would tell my clients that if I wasnt able to give them
a major performance boost within two days of arriving, I would cut my rate by
50%. I never had to do that, because usually within two hours of arriving at a
customer, I was able to point out one of two or three very common issues that
would drastically improve their situation.
I was very deeply into NHibernate consulting at the time. Which meant that
most of my consulting was done around database interactions or the overall system architecture. In one memorable case, I helped a customer reduce the load
on his system from roughly 17,000 (seventeen thousand) queries per page to a
mere 75 queries per page. That is correct they had 17 thousand queries per
page. In production. For several years. Admittedly, that is quite an extraordinary case. But my target was 75% - 90% reduction in the number of queries
that the system made, and a comparable increase in performance.
Usually I actually got to use my NHibernate knowledge to optimize the way
the system accessed the database. But that usually happened at least several
days into my visit. The rst few days always consisted of nding the common
Select N+1 issues, nding the common Unbounded Result Set issues, etc. I
didnt even try to nd all of those issues, because they were so prevalent that
just xing the rst three or ve would usually give the system a very big boost
in performance and make the customer happy.
It made me very sad, however. I was the database consultant equivalent of a
plumber. I would go to a client, unclog the database, give some advice about
how to do better, and move on to another client to do the exact same thing.

2.1. THE BACK STORY

17

I blogged about those issues extensively, and most of the people who invited
me to look over their code were readers of my blog. By denition, then, they
werent stupid, careless or malicious. In fact, most of them were dedicated, hard
working and very good at their jobs.

2.1.2 Maybe tooling will help?


Something had to give, because if I saw yet another Select N+1 bug, I felt that I
would just scream. So in 2008 I started working on what became the NHibernate
and Entity Framework Prolers. These tools monitor the interaction between
your application and a relational database. Based on that information, they can
show you how your code interacts with the database and what queries are being
generated by what methods. On top of that, they analyze the database interactions and use that information to raise warnings and suggest improvements.
Building this tool meant that I could give it to my clients, and they would
have the same insights as I did about how to properly work with a relational
database. It also gave me an interesting project to work on that didnt involve
nding the same issues over and over again with slight variations on the naming
conventions.
The prolers are wonderful tools when you are working on relational databases,
but they didnt actually solve the problems that I saw over and over again. It
was not easy to pinpoint where the application was doing silly things. In essence,
that meant that we had just moved the problem; no longer having to deal with
the common errors, we now had to deal with a much more complex ones. How
could we get all of that complex data to the user, manipulate it properly and
persist it in a consistent, fast and maintainable way?
The unfortunate answer was that we really couldnt. There were quite a lot of
constraints placed upon us by the choice of relational data store. Relational
databases are wonderful things masterpieces of engineering thought and the
culmination of decades of experience. They are also, quite often, the wrong tool
for the job.

2.1.3 Think dierent (database)


It was at that point that I actually took the very daring steps of trying to set
out and actually plan what would be my ideal database. The kind of database
that wouldnt make me ght it. The kind of database that would allow me to
really just get things done, instead of writing another schema migration script.
I thought about it often enough that I actually had a whole design sketched out
in a series of blog posts. The idea just wouldnt leave me; eventually I broke
down started writing the code. Initially, I thought it would just be a spike, or

18

CHAPTER 2. A LITTLE HISTORY

another OSS project. I called it Rhino.DivanDB I think you can guess what
inspired me.
The problem is that pronouncing Rhino.DivanDB is pretty hard (try saying it
out loud several times). Eventually, I realized that I was actually calling it
RivanDB. From there, it was just a matter of nding a word that was close nd
made sense. In the end, this is how RavenDB was born.

Figure 2.1: Raven on rock


Since then, Ive become quite fond of ravens; many of the internal components
inside RavenDB are now named after various ravens. At around the same time
that I decided on the name change, I also realized that it wouldnt be enough
for me to undertake this as only an open source project.
I had a burning desire to actually go and make this. And not just as a random
Github repository I wanted to make it great. That meant that it had to be
a product. Discussing the nuances of product vs. project is out of scope here
even in this seemingly o-topic section but in general, it meant I had to sit
down and formulate an actual business plan. I needed to plan what I was going
to invest, how the money was going to come in, the cut o point if it was an
utter failure, and at what point Id allow myself to drink champagne.

2.1.4 Setting RavenDB free


In May of 2010, there was a public release of RavenDB 1.0. That was quite an
event, and I was very pleased with what we had managed to achieve. The plan
called for working on RavenDB, showing it o, and building condence in it. I
expected this process to take 18 months, since the lead time for something as
critical as a database is usually very long. Instead, we had a system running in

2.2. THE CASE FOR A NON RELATIONAL DATA STORE.

19

production using RavenDB within four months! And the pick up since then has
been well above what I initially expected.
Oh well, Ill settle for building great databases, rather than realistic business
plans :-).
Today, the RavenDB Core Team has about 15 full time developers, and the
limiting factor is the quality we require from our people. RavenDB runs mission
critical systems from healthcare to banking from government institutions to
real estate brokerage systems. Several books has been written about RavenDB,
articles about it show up regularly in trade journals and you can hear talks in
user groups and conferences across the world.
Im quite happy with this situation, as you can imagine. And RavenDB is just
getting better
All of that said, the back story might be interesting to you or not, but you
arent here to read about the history of RavenDB. You are reading this because
you want to use RavenDB in your future. And that means that you need to
understand why youll want to do that

2.2 The case for a non relational data store.


Edgar F. Codd formulated the relational model in 1969. Ten years later, Oracle
2.0 comes to the market. And Sybase SQL Server came out with its rst version
in 1984. By the early 90s, it was clear that relational database has pushed
out the competition (such as navigational or object oriented databases) to the
sidelines. It made sense, you could do a lot more with a relational database,
and you could do it easier, usually faster and certainly in a more convenient
manner.
Let us look at what environment those relational databases grew up in1 . In
1979, you could buy the IBMs 3370 direct access storage device. It oered a
stunning 571MB (read, megabytes) of storage for the mere cost of $35,100. For
reference the yearly salary of a programmer at that time was $17,535. In other
words, the cost of a single 571MB hard drive was as much as two full time
developers.
In 1980, we had the rst GB drive, for merely $40,000, weighting 550 pounds and
about as big as a refrigerator. By 1986, the situation improved and purchasing
a good internal hard drive with all of 20MB at merely $800. For reference, a
good car at the time would cost you less than $7,000.
1 This is a somewhat apples & oranges comparisons that Im making here. It is hard to get
pricing on a consistent set of hard drives from those years. Some of the hard drive listed here
would be considered enterprise grade hardware, while others are consumer grade. The intent
is mostly to give you a good idea about the relative costs.

20

CHAPTER 2. A LITTLE HISTORY

Skipping ahead again, by 1996 you could actually purchase a 2.83 GB drive for
merely $2,900. A car at that time would cost you $12,371. I could go on, but Im
sure that you get the point by now. Storage used to be expensive. So expensive
that it dominated pretty much every other concern that you can think of.
At the time of this writing, you can get a 6 TB drive for less than $300 2 . And
a 3 TB drive will cost you roughly $100. That is 2014 for roughly 30 cents per
gigabyte, and 1980 for 40,000 dollars per gigabyte 3 .
Even leaving this aside, we also have to consider the type of applications that
were written at that time. In the 80s and early 90s, the absolute height of user
interface was the master/details form. And the number of users you had could
usually be counted on a single nger.
That environment produced databases that were optimized to answer the kind
of challenges that prevailed at the time. Storage was expensive, so a major eort
was made to reduce storage as much as possible. Users time was very cheap
by comparisons, so trade os that meant that we could save some disk space at
the expense of making the user wait were good design decisions. Time passed,
machines got faster and cheaper, disk space became cheap, making the user wait
became unacceptable, but we are still seeing those tradeos today.
Those are mostly evident when you look at normalization, xed schemas and
the relational model itself.

2.2.1 Normalization, compression and data corruption


Youve very likely used a relational database in the past. That means that
youve learned about normalization, and how important that is. Words like data
integrity are thrown around quite often, usually. But the original purpose of
normalization had everything to do with reducing duplication to the maximum
extent.
A common example for normalization is addresses. Instead of storing a customers address on every order that he has, well simply store the address id
in the order, and we have saved ourselves the need to update the address in
multiple locations. You can see a sample of such a schema in Figure 3 4 .
Youve seen (and probably wrote) such schemas before. And at a glance, youll
probably agree that this is a reasonable way to structure a database for order
management. Now, let explore what happens when the customer wishes to
change his address. The way the database is set up, we can just update a single
row in the Addresses table. Great, were done.
Except weve just introduce a subtle but deadly data corruption for our
database. If that customer has existing orders, both those orders and the
2 WD60EFRX

- if you really care


that we are ignoring 30 years of ination, too.
4 You can also access the schema online
3 Note

2.2. THE CASE FOR A NON RELATIONAL DATA STORE.

Figure 2.2: A simple relational schema for orders

21

22

CHAPTER 2. A LITTLE HISTORY

customer information are all pointing at the same address. Updating the
address for the customer therefor will also update the address for all of its
orders. When well look at one of those orders, well not see the address that it
was shipped to, but the current customer address.
In the real world, Ive seen such things happen with payroll systems and
paystubs (payslips across the pond). An employee got married, and changed
her bank account information to the new shared bank account. The couple also
wanted to purchase a home, so they applied for a mortgage. As part of that,
they had to submit paystubs from the past several months. That employee
requested that HR department send her the last few stubs. When the bank
saw that there were paystubs made to an account that didnt even existed,
they suspected fraud, the mortgage was denied and the police was called. An
unpleasant situation all around.
The common response for showing this issue is that it is an issue of bad modeling
decisions (and I agree). The problem is that the appropriate model would mean
that each order has its own address id in the addresses table. That isnt a really
good idea, youll have to do additional joins to get the data. Combine that with
a real world model of even moderate complexity and the size and cost of the
model just explodes.

2.2.2 Unclear consistency boundaries


Modeling concerns in a relational database are also complex because there is
an impedance mismatch between the way we model our business objects and
the way the relational database forces us to persist them. Because a relational
database is limited to tables, rows, and columns we are forced to spread a single
entity across multiple tables. Let us look at the Orders table if Figure 3. An
Order is an entity in our system. It has its own unique existence. However, the
order lines only serve to store data about the order, they dont really have a
right to exist independently.
The problem with a relational database in this case is that the consistency
boundaries that it has is the row. However, in our business model, the consistency boundary isnt an order line. That isnt meaningful. The consistency
boundary is the Order. More generally, the consistency boundary we have is
the Aggregate Root.
That leads to a lot of contortions when using relational databases to get the kind
of consistency boundary that we need. It require to use coarse grain locking, and
be very careful about how we approach changing the database, lest we forget to
lock the Root Aggregate and corrupt our data. And when we are talking about
deep object graphs, just getting to the root aggregate can be very expensive. It
is expensive because relational databases are optimized for exactly the wrong
thing.

2.2. THE CASE FOR A NON RELATIONAL DATA STORE.

23

2.2.3 It isnt optimized for your scenarios


If your application is a typical one, you usually see a rate of a single write for
every dozens or hundreds of reads. However, a relational database is heavily
optimized toward writes over reads. It is relatively cheap to update a row (add
a line item to an order), but it is much more expensive to read (need to read the
order row, join to the order lines, load the products, the customer information,
etc.).
Remember when relational database were designed and built. It made a lot of
sense to do this optimization, because making the user wait a bit wasnt a big
deal at all. The machines time was a lot more expensive than the users time.
Today, that is a strange decision indeed, and something that many applications
are suering from.

2.2.4 Change is hard, lets not do that


Probably the most frustrating one is the sheer levels of friction that are involved
in working with a relational database. You have to dene your schema up front
(usually this is the time when you know the least about your project and your
needs). Any change in the system require a lot more work, and there is no easy
way to actually manage the current state of the database in a way that works
nicely within your development cycle.
As anyone who has ever had to maintain schema change scripts, deployment to
production when you have a new version is a nightmare, and just managing that
in source control, so your database and your code are in sync is a non trivial
task. This leads developers to get very creative with approaches for avoiding
schema changes. In one memorable case, the Type of a customer was dened
as an Int32. There were only 3 possible customer types, and that left 28 whole
bits available for usage before they had to introduce a new column. Untangling
that several years later was an utter joy, as you can imagine.
But all of those are just stu that make it hard to work with relational databases.
If it wasnt for the internet and the need for web-scale solution, we probably
wouldnt be talking about NoSQL or RavenDB.

2.2.5 Scale? Sure, get a bigger box


A hidden assumption in the relational model, the entire dataset is available at all
time, and there is little if any dierence regarding the access times to dierent
pieces of data (ignoring memory vs disk for now). This assumption held true as
long as we could use a single machine to hold the entirety of the dataset. When
talking about web-scale data sets, that is far from feasible.

24

CHAPTER 2. A LITTLE HISTORY

The usual scaling method for relational databases was to buy a bigger box.
Rinse & Repeat until you run out of bigger boxes, and at that point, you pretty
much stuck.
Since my day job is building databases, let us assume that I got the requirement
to build a relational database that would allow distribution of data among multiple nodes. The rst thing to do would be to create a table with a primary key.
We can just decide that certain key ranges would go to certain nodes, and we
can move on. That does raise the issue of what to do when a node is down. I
will not be able to read or write any rows which fall in this node range.5
Well ignore this problem for now and try to implement the next feature, a
unique constraint. That is required so I cant have multiple users with the same
email. But this is just making things that much harder again. Now every insert
or update to a users email will require us to talk to the other nodes. And what
happens if one of the nodes is unavailable? At this point, I dont know if I have
this email or not. I might have, if it is located in the node that I cannot access.
Well ignore this problem as well, and just assume that we have competent and
awesome DBAs and that nodes can never go down. What is the cost of making a
query? Well, simple queries I can probably route to a single node. But complex
queries?
Consider the following query, using the schema we have seen in Figure 3.
SELECT * FROM Orders o
JOIN O r d e r L i n e s o l
JOIN Products p
JOIN A d d r e s s e s a
JOIN Customers c
WHERE o . Id = 7331

on
on
on
on

o l . OrderId = o . Id
o l . ProductId = p . Id
o . A d d re s sI d = a . Id
o . CustomerId = c . Id

In a system with multiple nodes, how costly do you think this query is going to
be? This is a pretty simple query, but it still likely to require us to touch multiple
nodes. Even ignoring the option for failure, this is a ridiculously expensive way
to do business. And the more data you have, the more nodes you have, the
more expensive this become.
As the need for actual web-scale grew, people have noticed that it is not really
possible to scale out a relational database. And there is only so much scaling
up that you can do. This is where the need for alternative solution came into
play. Thus the need for NoSQL.
5 Yes, we can avoid this by duplicating the data on multiple nodes, but that just move the
problem around, instead of one node being down, we need two or three nodes down to have
the same eect.

2.3. MAKING SENSE IN THE NOSQL MENAGERIE

25

2.3 Making sense in the NoSQL menagerie


One of the hardest thing about NoSQL is that it is dened in a negative term.
The Windows Registry is a NoSQL database, for example. It is a database, and
it doesnt use either SQL or a relational model. In general, NoSQL databases
are non relational, can scale out more easily and are designed from the ground
up for the modern operating environment.
Typically, one count four dierent types of NoSQL databases:

Key/Value Databases
Graph Databases
Column Family Databases
Document Databases

2.3.1 Key/Value databases


Database systems that store a key and its associated value, as the name might
have told you. Some of them allow operations on the value (such as dening a
value as a set or a list that can be manipulated server side), but most of the
time, you are just working with the plain raw value (which is a byte array).
The good thing about them is that it is pretty easy to scale them, because they
oer a very simple API and they tend to be quite fast, since their internal works
is pretty simple. Scaling out is a matter of adding more servers and updating
the way we split the data among the servers6 . The bad thing about them is
that they are very simple, so a lot of the onus of using them properly is on the
user. You cannot make queries using a key/value store, you have to maintain a
secondary index manually, for example.

2.3.2 Graph databases


Composed of nodes and edges, graph databases allow you to model your domain
using graphs. That can be very helpful when you want to deal with highly
connected data, such as social graphs, travel options or network ows. Such
databases are very useful when one of the most important aspects of the data
isnt the just the data itself, but the association between dierent pieces of
information.
Graph databases allow you to dene nodes and associate those using edges, and
then allow queries to traverse them as needed. Some databases also oer some
graph algorithms (shortest path, Dijkstras, etc.), but a lot of them function
mostly as a queryable adjacency list.
6 Highly

simplied view, Im aware, but good enough for our purposes right now.

26

CHAPTER 2. A LITTLE HISTORY

A major aw is scaling such systems is that most graphs tend to be highly inter
connected, and it is very hard to isolate an independent sub graph and break it
into a separate machine. Consider the classic six degrees of separation theory,
and that the average distance between any two random Twitter user is less than
four.
Because of that, the use of graph databases is usually limited to just the associations that need to be handled. Most of the actual data is stored in another
type of database.

2.3.3 Column databases


A column database (sometimes called column family database) are meant to
store very large datasets. The data is structured as rows, but unlike relational
databases, where the row schema is xed, the columns are sparse and you can
have as many of them as you want. It is typical to have dierent rows with
dierent columns, and the column names arent just metadata, they are part of
the data.
The actual storage of the data is done on a sparse column basis. Which means
that aggregation of the data is very quick. For the most part, that means that
you need to pay attention to the actual structure of the data as much as to
the data itself. Column databases are good for very big datasets that need
to be distributed over many nodes. They have a high degree of operational
and development complexity, and oer little advantages if your dataset is small
enough to t on a small number of nodes.

2.3.4 Document databases


Document databases are very similar to key/value stores. They store a document by key. But unlike a key/value store, where the value is usually arbitrary,
in a document database the format7 of the documents is well known, usually
JSON. That means that the database server can perform operations on that
data.
Those operations include querying, partial document updates and aggregation
(among others). The major benets of document databases are easy model to
work with, smooth scaling model and good support from the database engines.
In particular, document databases are very well suited for OLTP8 and Domain
Driven Design.
Since OLTP application is what I did for a living for a long time, it is not a
surprise that this is the area where RavenDB is focused.
7 But

not the schema


Transaction Processing

8 Online

2.4. 2ND GENERATION DOCUMENT DATABASE

27

2.4 2nd generation document database


A guy wakes up one day, and decides he want to build a database
No, you dont get to hear the rest of the joke. But this is a serious question.
Why build another NoSQL database? When sitting down to design what would
become RavenDB, I had a few goals in mind. At the time, I mostly dealt
with consulting to clients building line of business applications. Those ranged
from small potatoes such as we need to track expenses to medium size risk
management system to monster systems running the core business processes for
Fortune 100 companies.
The core principal for most line of business applications is OLTP, which stands
for OnLine Transaction Processing. That is the major area that I set out to
building RavenDB to serve. RavenDB success means that it would be the obvious choice for any OLTP system. And I think we have gone quite a bit toward
that destination.
RavenDB design goals included:

Zero friction throughout (dev & ops)


Zero admin
Safe by Default
Transactional / ACID
Easily scalable

Note that most of those design goals are actually just dierent ways to say the
same thing. Basically, the goal of RavenDB is that it Get Out of Your Way and
Just Works.

2.4.1 Let me do my job, and youll do yours


We had to do quite a lot of work in order to achieve the It Just Works model.
The RavenDB engine will monitor itself and (within congured limits) knows
how to auto tune itself on the y. The Safe by Default design means that it is
harder (sadly, it is still not impossible) to shoot yourself in the foot.
But probably most importantly, we have look at other databases, relational and
NoSQL alike, and gured out what the pain points were. What we found was
that mostly, people were struggling because it was so hard. In order to deploy
a NoSQL solution you had to become an expert on your database, and even
then, you usually had to babysit it quite often. Im talking about everything
from installation and conguration, to looking at the data your put inside the
database, to understanding errors and monitoring the system in production.
That was doubly true when you consider that my usual ecosystem is the .NET
framework. That meant that most of my applications and systems are actually

28

CHAPTER 2. A LITTLE HISTORY

running on Windows. And most NoSQL solutions either at out couldnt run,
or could run, but only as alpha quality software. My goal was to create a really
good database that .NET developers could use. As it turned out, we managed
to do quite a bit more than that.
As a small example of that. Here are the installation instructions for RavenDB:

Go to the RavenDB Download page.


Download the latest stable build
If youve downloaded the zip archive, unzip it and double click Start.cmd
Alternatively, if youve downloaded the MSI le, install it and follow the
wizard.

You now have a running RavenDB database, and you can browse to (by default)
http://localhost:8080/ to access the RavenDB Studio. The point in these instructions isnt so much to explain how to install RavenDB, well touch on that
later. The point is that this is literally the smallest number of steps that we
could get to getting RavenDB up & running.
This is probably a good time to note that this is actually how we deploy to our
own production environment. As part of our dogfooding eort, we always run
our systems using the default conguration, to make sure that RavenDB can
optimize itself for our needs automatically.
Most people get excited when they see that RavenDB ships with a fully functional management studio. There is no additional tool to install, just browse
to the database URL and you can start working with your data. To be honest,
even though we invested a lot of time and eort in the studio, that is quite
insulting. Weve spent even more time making sure that the actual database
engine is pretty awesome, and people get hung up on the UI.
A picture is worth a thousand words, and I think that Figure 2 can probably
help you understand what we didnt want to have.
When I initially started looking at the NoSQL landscape, there were a lot of
really great ideas, and some projects that really picked up traction. But all of
them were focused on solving the technical details, let us create a database that
can do this and that. This resulted in expert tools. The kind that you could
do really great things with, but only if you were an expert. If you werent an
expert, however, those kind of tools would be worse than useless. You might
end up removing more than just your foot, trying to use them.
The inspiration behind RavenDB was simple. It doesnt have to be so hard.
RavenDB was conceived and designed primarily so we can take the essence
behind the NoSQL movement and translate that to the kind of tooling that you
can use even without spending three months of your life learning the ins and
outs of your tooling. This was done, ruthlessly nding all points of friction and
eliminating them with extreme prejudice.

2.4. 2ND GENERATION DOCUMENT DATABASE

29

Figure 2.3: 1st gen NoSQL


Being able to see all the hurdles that other projects had to deal with meant
that we were able to avoid most of them and provide a consistent, simple and
pleasant experience for our users. That was ve years ago, at the time of this
writing. Since then, we have continue to push things forward.
The purpose of this book is to provide you with all the information you need to
develop, deploy and manage a RavenDB based application in your own organization. This completes the side track into history and my reasons into starting
to develop RavenDB, now we can move on to the real juicy stu. Actually using
RavenDB enjoy the ride.

30

CHAPTER 2. A LITTLE HISTORY

Chapter 3

Zero to 60 with RavenDB,


from installation to usage
In this chapter, we will install RavenDB and start working with it. Before we
can get started with RavenDB, we need to have a live instance that we can work
with. There are several ways to run RavenDB:
Development - console mode
Production
Windows Service
IIS
In the cloud - RavenHQ
The console mode is very useful for development, because you can see the incoming requests to RavenDB, and you can interact with the server directly. For
production usage, you can install RavenDB as a Windows Service or in IIS.
Well discuss the relative merits of each option toward the end of this chapter.
For running in the cloud, you can use the RavenDB as a Service option provided
by RavenHQ. You can read more on RavenHQ in Chapter TODO, for now, well
focus on everything you need to get RavenDB running on your own machine.

3.1 Setting up everything


Go to the RavenDB download page, and download the latest version. At the
time of this writing, this is version 3.0. You should download the zip archive
and extract it. Then go to the Start.cmd le and double click it. This will start
31

32CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


RavenDB in console (debug) mode as well as open your browser to point to the
RavenDB Management Studio1 .
Note the URL in the browser.
By default, RavenDB will try to use
http://localhost:8080 as its endpoint. But if you have a service already
taking this port, it might select port 8081, etc. If you are not running as an
administrator, RavenDB will ask you to authorize access to the relevant HTTP
port.
The studio will ask you to create a database, please name the database Northwind, and press the Create button. You can ignore the bundles selection for
now, well discuss them at length later. Now that we have a database, go to
the Tasks tab and then to Create Sample Data dialog, press the Create Sample
Data button. You should see a progress bar running for a short while, and now
you have a new database, including data that we can play with.
The Northwind database is the sample database that came with SQL
Server, it has been used for decades as the sample database in the
Microsoft community. We choose this database as our sample data
because it is likely already familiar to you in its relational format.
Go to the Documents tab and then select the Products collection on the left.
You should see something similar to Figure 1.

Figure 3.1: The Products collection in the Northwind database


This looks remarkably similar to what youll see in a relational database. The
data is shown in grid format, and we have the tables on the left. If you click
on one of the products (the link is on the left most column), youll enter into
the actual document view, as shown in Figure 2.
1 The

acronym for the studio is RDBMS.

3.1. SETTING UP EVERYTHING

33

Figure 3.2: Editing a product document in the RavenDB Studio


Now we can actually see the JSON data of the documents. But products are
pretty simple documents, pretty much a one to one translation from the relational model. Let us look at a more interesting example. Let us go to orders/827,
you can use the Go To Document text box at the top of the studio to go directly
there. The content of that document is shown is Listing 3.1.
You can see several interesting things in Listing 3.1. We no longer have a simple
key/value model that matches exactly to the column values in the relational
model. We can aggregate related information into a common object, as in the
case of the ShipTo property which has all of the shipping information.
But probably even more important is the way we are handling the line items.
In the relational schema, those were relegated to a separate table. And loading
the orders data would require us to join to that table. Here all of the order
information, including the collection of line items are included directly in the
document.
Well discuss this at length when we talk about modeling in Chapter 3, but as
you can imagine at this early stage, this capability signicantly reduces both
complexity and cost of getting the data from the database.
Listing 3.1: The orders/827 document
{

34CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


Company : companies / 7 3 ,
Employee : employees / 7 ,
OrderedAt : 19980506T00 : 0 0 : 0 0 . 0 0 0 0 0 0 0 ,
RequireAt : 19980603T00 : 0 0 : 0 0 . 0 0 0 0 0 0 0 ,
ShippedAt : n u l l ,
ShipTo : {
Line1 : V i n b l t e t 3 4 ,
Line2 : n u l l ,
City : Kobenhavn ,
Region : n u l l ,
PostalCode : 1 7 3 4 ,
Country : Denmark
},
ShipVia : s h i p p e r s / 2 ,
Freight : 18.44 ,
Lines : [
{
Product : p r o d u c t s / 1 6 ,
ProductName : Pavlova ,
PricePerUnit : 17.45 ,
Quantity : 1 4 ,
Discount : 0.05
}
]
}
Well not be going through all the things that you can do with the studio.
Instead, well refer back to those whenever we want to show you something
new or interesting that has relevancy in the studio as well (the operational
monitoring capabilities, visualization of work, etc.).
Now that we have a running system, feel free to explore it a bit, and then well
move to the fun part, using RavenDB in our application.

3.2 Coding with RavenDB


Start Visual Studio and create a new Console Application Project named
Northwind. Then, in the Package Manager Console, issue the following
command:
I n s t a l l Package RavenDB . C l i e n t
This command uses NuGet to get the RavenDB Client package and add a reference to it to your project. Now we just need to tell the client where the server
is located. Add a using statement for Raven.Client.Document, and then create
a document store, like so:

3.2. CODING WITH RAVENDB

35

var documentStore = new DocumentStore


{
Url = h t t p : / / l o c a l h o s t : 8 0 8 0 ,
D e f a u l t D a t a b a s e = Northwind
};
documentStore . I n i t i a l i z e ( ) ;

Note that if your RavenDB server is running on a dierent port,


youll need to change the document stores URL.
The document store is the starting point for all your interactions with RavenDB.
If you have used NHibernate in the past, the DocumentStore is very similar to
the SessionFactory. We use the document store to create sessions, which is how
we usually read and write data from RavenDB.
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p = s e s s i o n . Load<dynamic >( p r o d u c t s / 1 ) ;
C on s o l e . WriteLine ( p . Name ) ;
}
You can see that we didnt have to dene anything, we can immediately start
working with RavenDB. The schema less nature of RavenDB, combined with the
dynamic option in C# allows us to work in a completely dynamic world. But
for most things, we actually do want some structure. Our next step would be to
introduce the model classes to our project. In the studio, go to the Tasks tab,
then to the Create Sample Data dialog. Press the Show Sample Data Classes
button, and copy the resulting text to Visual Studio. Listing 3.2 shows the
Product sample data class.
Listing 3.2: The sample Product class
p u b l i c c l a s s Product
{
p u b l i c s t r i n g Id { g e t ; s e t ; }
p u b l i c s t r i n g Name { g e t ; s e t ; }
public s t r i n g Supplier { get ; set ; }
p u b l i c s t r i n g Category { g e t ; s e t ; }
p u b l i c s t r i n g QuantityPerUnit { g e t ; s e t ; }
public decimal PricePerUser { get ; s e t ; }
public int UnitsInStock { get ; set ; }
p u b l i c i n t UnitsOnOrder { g e t ; s e t ; }
public bool Discontinued { get ; set ; }
public i n t ReorderLevel { get ; s e t ; }
}

36CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


You can see that there really isnt anything special about this class. There is no
special base class, attributes or even the requirement that the class would have
virtual members. This is a Plain Old C# Object in its purest form. How do
we use this class? Here is the same code as before, but using the Product class
instead of dynamic.
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
C o n s ol e . WriteLine ( p . Name ) ;
}
We load the product by id, then print out its name. It Just Works.

3.3 The basics of the client API


So far, we have setup RavenDB, explore the studio, wrote some code to connect
to RavenDB and pull data out and dened strongly typed classes that allows
us to work with RavenDB more easily. This is all well and good, but as fun as
blind experimentation is, we need to understand what is going on in order to
do great things with RavenDB.

3.3.1 The document store


Youve already used the document store to talk to RavenDB, but what is its
purpose? The document store holds the RavenDB URL, the default database
well talk to and the credentials that will be used. It is the rst thing that we
create when we need to talk to RavenDB. But its importance extends beyond
just knowing who to talk to.
The document store holds all the client side conguration for RavenDB, how we
are going to serialize your entities, how to handle failure scenario, what sort of
caching strategy to use, and much mode. In typical application, you shall have
a single document store instance per application (singleton). Because of that,
the document store is thread safe, and a typical initialization pattern looks like
Listing 3.3.
Listing 3.3: Common pattern for initialization of the DocumentStore
p u b l i c c l a s s DocumentStoreHolder
{
p r i v a t e r e a d o n l y s t a t i c Lazy<IDocumentStore> _ s t o r e =
new Lazy<IDocumentStore >(CreateDocumentStore ) ;
p r i v a t e s t a t i c IDocumentStore CreateDocumentStore ( )

3.3. THE BASICS OF THE CLIENT API

37

{
var documentStore = new DocumentStore
{
Url = h t t p : / / l o c a l h o s t : 8 0 8 0 ,
D e f a u l t D a t a b a s e = Northwind ,
};
documentStore . I n i t i a l i z e ( ) ;
r e t u r n documentStore ;
}
p u b l i c s t a t i c IDocumentStore S t o r e
{
g e t { r e t u r n _ s t o r e . Value ; }
}
}
The use of Lazy ensures that the document store is only created once, without
having to worry about double locking or explicit thread safety issues. And
we can congure the document store as we see t. The rest of the code has
access to the document store using DocumentStoreHolder.Store. That should
be relatively rare, since apart from conguring the document store, the majority
of the work is done using the session. But before we get to that, let us see what
sort of conguration we can do with the document store.
3.3.1.1 Conventions
The RavenDB Client API, just like the rest of RavenDB, aims to Just Work.
As a result of that, it is based around the notion of conventions. A series of
policy decisions that has already been made for you. Those range from deciding
which property holds the document id to how the entity should be serialized to
a document.
For the most part, we expect that youll not have to touch the conventions. A
lot of thought and eort has gone into ensuring that youll have no need to do
that. But there is simply no way that we can foresee the future, or answer every
need, which is what pretty much every part of the client API is customizable.
Most of that is handled via the DocumentStore.Conventions property, by registering your own behavior. For example, by default the RavenDB Client API
will use a property named Id (case sensitive) to store the document id. But
there are users who want to use the entity name as part of the property name.
So well have OrderId for orders, ProductId for products, etc.2 .
Here is how we can tell the RavenDB Client API that it should use this behavior:
2 Ill

leave aside Id vs. ID, since it is handled in the same manner

38CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


documentStore . Conventions . F i n d I d e n t i t y P r o p e r t y =
prop => prop . Name == prop . De cl a ri ng T yp e . Name + Id ;
Im not going to go over each option in the conventions since there are literally
dozens of them. There are API comments on each of the exposed options, and
it is probably worth your time to go and peruse through them, even if for the
most part, they arent really something that youll touch.
Other options that can be controlled via the document store are request timeout,
caching conguration, creating indexes and transformers, setting up listeners,
listening to changes and doing bulk inserts into RavenDB. Well cover those
further into this book.
3.3.1.2 Connection strings
You might have noticed that when we dened the document store so far, we
have done so using hard code URL and database, like so:
var documentStore = new DocumentStore
{
Url = h t t p : / / l o c a l h o s t : 8 0 8 0 ,
D e f a u l t D a t a b a s e = Northwind ,
};
This is great when we just want to play around, but it isnt really suitable for
actually working with RavenDB on a day to day basis. Dierent environment
has dierent URLs, databases and credentials. In here, RavenDB makes no
attempt to invent the wheel and it uses connection strings to specify all the
details. The following code snippet show how to congure RavenDB using a
connection string:
var documentStore = new DocumentStore
{
ConnectionStringName = RavenDB
};
This code instructs the document store to go and look at the element in the
web.cong (or app.cong) les. Listing 3.4 shows a sample of a few such connection strings:
Listing 3.4: RavenDB Connection Strings
<c o n n e c t i o n S t r i n g s>
<add name=RavenDB c o n n e c t i o n S t r i n g=
Url=h t t p : // l o c a l h o s t : 8 0 8 0 ;
Database=Northwind ; ApiKey=MyApp/1 h j d 1 4 h d f s />
<add name= Another c o n n e c t i o n S t r i n g=

3.3. THE BASICS OF THE CLIENT API

39

Url=h t t p : // l o c a l h o s t : 8 0 8 0 ;
User=beam ; Password=up ; Database=S c o t t y />
<add name=Embedded c o n n e c t i o n S t r i n g= DataDir=~\Northwind />
</ c o n n e c t i o n S t r i n g s>
In this manner, you can modify which server and database your client application will talk to by just modifying the conguration. You might also have
noticed that we have an embedded connection string as well, what is that?
3.3.1.3 Document store types
RavenDB can run in several modes. The most obvious one is when you run it as
a console application and communicate with it over the network. In production,
you do pretty much the same thing, except that youll run RavenDB in IIS
or as a Windows Service. This is great for building server applications, where
you want independent access to your database, but there are other options with
RavenDB.
You can run RavenDB as part of your application, embedded inside your own
process. If you want to do that, just use the EmbeddableDocumentStore class,
instead of DocumentStore. You can even congure the EmbeddableDocumentStore
to talk to a remote server or an embedded database just by changing the
connection string. The main advantage of using an embedded RavenDB
instance is that you dont need separate deployment or administration. There
is also no need to traverse the network to access the data, since it lives inside
the same process as you own application.
This option is particularly attractive for teams building low overhead system
or business applications that are deployed client side. Octopus Deploy is an
automated deployment system that makes use of RavenDB in just such a manner.
Even if you use it, youre probably not aware that it is using RavenDB behind
the scenes, since that is all internal to the application.
On the other side, you have NServiceBus, which also makes heavy use of
RavenDB, but usually does so in a server mode. So youll install RavenDB as
part of your NServiceBus deployment and manage it as an independent service.
From coding perspective, there is very little dierence between the two. In fact,
even in embedded mode, you are going through the same exact code paths youll
be going when talking to a remote database, except that there is no networking
involved.
3.3.1.4 Authentication
A database holds a lot of information, and usually it is pretty important that
youll have control over who can access that information and what they can do

40CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


with that. RavenDB fully supports this notion.
In development mode, youll usually work with the Raven/AnonymousAccess
setting set to Admin. In other words, any access to the database will be considered to be an access by an administrator. This reduces the number of things
that you have to do upfront. But as easy as that is for development, for production, we set that setting to be None. This option requires all access to the
database server3 to be done only by authenticated users.
Users can be authenticated in one of two ways. Either using Windows Authentication via user/pass/domain credentials or using OAuth via an API Key4 . You
can congure access to RavenDB by going to the Databases page in the Studio
(click on the link at the top right corner), then selecting the System Database
(on the right of the page). Now got to the Settings tab.
Here you can see the options for conguring access. In large organizations, it
is very common to want to run all authentication through Active Directory,
because that lets the operations team centralized control over all the users. You
can select Windows Authentication and then add a new user or group, and grant
them access to specic databases. You can also grant access to all databases
by using an asterisk (*) as the database name. The asterisk does not include
access to the system database, you need to congure that independently.
Even though using Windows Authentication is quite common, I really like using
API Keys, instead. You can see how we congure API Keys in Figure 3. Ive
had quite a few issues with relying on Active Directory to really feel comfortable
with it. Mostly because of interesting policies that the organizations dened.
From password expiration every 3 months that takes down systems whose congurations havent been updated to deleting inactive user accounts that takes
down systems to Im sure you get the point.
The good thing about API Keys is that they are not users. They are not tied to
a specic person or need to be managed as such. Instead, they represent specic
access that was granted to the database for a particular reason. On Figure 3
you can see that read/write access was granted to the Northwind database for
the OrdersApp API Key. I nd that this is a much more natural way to handle
authentication.
Regardless of the authentication method that was chosen, you can control the
level of access to a database by granting read-only, read/write or admin permissions. Read-only and read/write are quite obvious, but what does it mean, to
have admin privileges on a database?
3 Note
4 You

Server

that this is a server level option, rather than a database level option
can think about this as Windows Authentication and SQL Authentication in SQL

3.3. THE BASICS OF THE CLIENT API

Figure 3.3: Conguring API Keys access

41

42CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


3.3.1.5 Administrators
Being an administrator means that you can perform operational actions such
as perform a backup, stop/resume indexing, kill queries or background tasks,
compact the database or force index optimizations. Those are all admin actions
at the single database level. A user or API Key can be granted an admin level on
a single database, or all of them (by specifying asterisk as the database name).
That does not make them the server administrators.
The server administrators are anyone in the Administrators group for the domain or for the local machine RavenDB is running on. In addition to those, the
user account who is running RavenDB also have a server administrator permission for RavenDB. This last is done so you can run RavenDB on a least privilege
account and still do administration work on the server without requiring you to
be a system administrator.
You can also congure specic users or API Keys as the server administrators.
That can be done by granting them admin permission on the database. A
server administrator can create or delete databases, change database settings,
force garbage collection, collect stats from all databases and in general watch
over everything that happens.
It is possible to dene a server administrator that can manage the server, but has
no access to the actual databases on that server. This is usually used in scenarios
where regulatory compliance forbid even the administrators from being able to
access the data on the server. Usually in such scenarios, the Encryption bundle
is also used, but that will be discussed later in the book. (TODO: reference)
We have gone over all the major aspects of the document store, but we have
neglect one small detail. The reason that the document store even exists is that
we use it to create sessions, which actually interact with RavenDB. Its a good
thing that this is what our next section is talking about.

3.3.2 The session


The session (formally known as document session, but we usually shorten it to
just a session) is the primary way that your code interacts with RavenDB. If
you are familiar with NHibernate or Entity Framework, you should feel right at
home. The RavenDB session was explicitly modeled to make it easy to work
with.
Terminology
We tend to use the term document to refer both to the actual documents on the server, and to manipulating them client side. It is
common to say: load that document and then. But occasionally we need to be more precise. We make a distinction between

3.3. THE BASICS OF THE CLIENT API

43

a document and an entity (or aggregate root). A document is the


server side representation, while an entity is the client side equivalent. An entity is the deserialized document that you work with
client side, and save back to the database to become an updated
document server side.
Let us start with the basics, writing and reading data with RavenDB. We create
a session via the document store, and we can load a document using the surprisingly named method Load. We have already seen that we can use dynamic
or one of our entity classes there. But how about modifying data? Take a look
at Listing 3.5.
Listing 3.5: Creating and modifying data in RavenDB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

// c r e a t i n g a new p r o d uc t
s t r i n g productId ;
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p r o d u c t = new Product
{
Category = Awesome ,
Name = RavenDB ,
S u p p l i e r = H i b e r n a t i n g Rhinos ,
};
s e s s i o n . S t o r e ( p r o du c t ) ;
p r o d u c t I d = p r o d uc t . Id ;
s e s s i o n . SaveChanges ( ) ;
}
// l o a d i n g & m o d i f y i n g t h e p r o du c t
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p = s e s s i o n . Load<Product >( p r o d u c t I d ) ;
p . R e o r d e r L e v e l ++;
s e s s i o n . SaveChanges ( ) ;
}
There are several interesting things in Listing 3.5. Look at the Store() call in
in line 11, immediately after that call, we can already access the document id,
even though we didnt save the change to the database yet. Next, on line 19, we
load the entity from the document, update the entity and call SaveChanges().
The session is smart enough to understand that the entity has changed and update the matching document on the server side. You dont have to call Update()
method, or anything of this sort. The session keeps track of all the entities you
have loaded, and when you call SaveChanges(), all the changes to those entities
are sent to the database in a single remote call.

44CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


Budgeting remote calls
Probably the easiest way to kill your application performance is to
make a lot of remote calls. And the common culprit is the database.
It is common to see application making tens of calls to the database,
usually for no good reason. In RavenDB, we have done several things
to mitigate that problem. The most important among them is to
allocate a budget for every session. Typically a session would encompass a single operation in your system. An HTTP request or
the processing of a single message is usually the lifespan of a session.
And a session is limited by default to a maximum of 30 calls to
the server. If you try to make more than 30 calls to the server, an
exception is thrown. This serves as an early warning that your code
is generating too much load on the system and as a Circuit Breaker5 .
You can increase the budget, of course, but just having that in place
ensures that you will think about the number of remote calls that
you are making.
The limited number of calls allowed per session also means that
RavenDB has a lot of options to reduce the number of calls. When
you call SaveChanges(), we dont need to make a separate call per
changed entity, we can go to the database once. In the same manner,
we also allow to batch read calls. Well discuss it in the next chapter,
on the Lazy section.
One of the main design forces behind RavenDB was the idea that it should Just
Work. And the client API reect that principle. If you look at the surface API
for the session, there are the following high level options:

Load()
Include()
Delete()
Query()
Store()
SaveChanges()
Advanced

Those are the most common operations that youll run into on a day to day
basis. And more options are available in the Advanced property.
3.3.2.1 Load
As the name implies, this gives you the option of loading a document or a set
of documents into the session. A document loaded into the session is managed
5 See

Release It!, a wonderful book that heavily inuenced the RavenDB design

3.3. THE BASICS OF THE CLIENT API

45

by the session, any changes made to the document would be persisted to the
database when you call SaveChanges. A document can only be loaded once in
a session. Lets look at the following code:
var p1 = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
var p2 = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
A s s e r t . True ( Object . R e f e r e n c e E q u a l s ( p1 , p2 ) ) ;
Even though we call Load<Product>(products/1) twice, there is only a single
remote call to the server, and only a single instance of the Product class. Whenever a document is loaded, it is added to an internal dictionary that the session
manages. Whenever you load a document, the session check in that dictionary
to see if that document is already there, and if so, it will return the existing
instance immediately. This helps avoid aliasing issues and also generally helps
performance.
For those of you who deals with patterns, the session implements the Unit of
Work and Identity Map pattern. This is most obvious when talking about the
Load operation, but it also applies to Query and Delete.
Load can also be used to read more than a single document at a time. For
example, if I wanted three documents, I could use:
Product [ ] p r o d u c t s = s e s s i o n . Load<Product >(
products /1 ,
products /2 ,
p r o d u c t s /3
);
This will result in an array with all three documents in it, retreived in a single
remote call from the server. The positions in the array match the positions of
the ids given to the Load call.
You can even load documents belonging to multiple types in a single call, like
so:
o b j e c t [ ] i t e m s = s e s s i o n . Load<o b j e c t >(
products /1 ,
c a t e g o r i e s /2
);
Product p = ( Product ) i t e m s [ 0 ] ;
Category c = ( Category ) i t e m s [ 1 ] ;
A missing document will result in null being returned. Both for loading a single
document and multiple documents (a null will be returned in the id position).
The session will remember that couldnt load that document, and even if asked

46CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


again, will immediately return null, rather than attempt to load the document
again.
RavenDB ids are usually in some form of collection + / + number. This
makes it very easy to look at and debug, but it does make them somewhat of a
hassle to work with in web scenarios. That is especially true when dealing with
web routes. Because of that, there is a simple convention that lets you use just
the number as the document id.
var p r od u c t = s e s s i o n . Load<Product > ( 1 ) ;
This code is nice to use in web scenarios, because you can easily get
an integer id from the web framework. It relies on a convention that
matches Product class and the numeric id and generate the nal document
key: products/1.
You can modify this convention by modifying the
store .Conventions.FindIdValuePartForValueTypeConversion property.
Even though you can load the document using just the numeric part, the actual
document id is the full products/1. This is merely a convenience feature, it
doesnt change the way ids are handled.
3.3.2.2 Include
I previously mentioned that there is a budget for the number of remote calls
that you can make from a session. Include is one of the chief ways to reduce the
number of remote calls you are doing. We want to print a product, we can do
it using the following code:
var p r od u c t = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
Co n s ol e . WriteLine ( { 0 } : { 1 } , p r o d u c t . Id , p r o d u c t . Name ) ;
Co n s ol e . WriteLine ( Category : { 0 } , p r o d u c t . Category ) ;
This code will have the following output:
p r o d u c t s / 1 : Chai
Category : c a t e g o r i e s /1
I think that you can agree that this isnt a user friendly output. What we want
is to print the category name and its description. In order to do that, we need
to load it as well, like so:
var p r od u c t = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
var c a t = s e s s i o n . Load<Category >( p r o d u c t . Category ) ;
Co n s ol e . WriteLine ( { 0 } : { 1 } , p r o d u c t . Id , p r o d u c t . Name ) ;
Co n s ol e . WriteLine ( Category : { 0 } , { 1 } ,
c a t . Name , c a t . D e s c r i p t i o n ) ;
Which gives us the much nicer output:

3.3. THE BASICS OF THE CLIENT API

47

p r o d u c t s / 1 : Chai
Category : Beverages , S o f t d r i n k s , c o f f e e s , t e a s , b e e r s , and a l e s
This results in the right output, but we have to go to the server twice. That
seems unnecessary. We cannot use the Load overload that accepts multiple
ids, because we dont know ahead of time what the value of the Category will
be. What we can do is ask RavenDB to help us. Well change the rst line of
code to be:
var p r o du c t = s e s s i o n . I n c l u d e <Product >(x => x . Category )
. Load ( p r o d u c t s / 1 ) ;
The rest of the code will remain unchanged. This single change has profound
eect on the way the system behaves because it tells the RavenDB server to do
the following:

Find a document with the key: products/1


Read its Category property value.
Find a document with that key.
Send both documents back to the client.

RavenDB can do that because the reply to a Load request has two channels to
it. One channel for the actual results (products/1 document) and another for
all the includes (categories/1 document).
The session know how to read this included information and store that separately.
When the Load<Category>(categories/1) call is made, we can retrieve that
data directly from the session cache, without having to go to the server. This
can save us quite a bit on the number of remote calls we make.
Includes arent joins
It is tempting to think about Includes in RavenDB as similar to a
join in a relational database. And there are similarities, but there are
fundamental dierences. A join will modify the shape of the output,
it combines each match row from one side with each matching row
on the other, sometimes creating Cartesian Products that are can
cause night sweats for DBAs.
And the more complex your model, the more joins youll have, the
wider your result sets become, the slower your application will become. In RavenDB, there is very little cost to adding includes. That
is because they operate on a dierent channel than the results of the
operation.
Includes are also important in queries, and there they operate after
paging has applied, instead of before paging, like joins.

48CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


The end result is that Includes dont modify the shape of the output,
dont have a high cost when you use multiple includes and dont
suer from problems like Cartesian Products.
You can also use multiple includes in a single call, like so:
var p r od u c t = s e s s i o n . I n c l u d e <Product >(x => x . Category )
. I n c l u d e ( x=>x . S u p p l i e r )
. Load ( p r o d u c t s / 1 ) ;
This will load both the category and the supplier documents into the session,
in one shot. And more complex scenarios are also possible. Here is one such
example:
var o r d e r = s e s s i o n . I n c l u d e <Order >(x => x . Company )
. I n c l u d e ( x => x . Employee )
. I n c l u d e ( x => x . L i n e s . S e l e c t ( l => l . Product ) )
. Load ( o r d e r s / 1 ) ;
This code will, in a single remote call, will load the order, include the company
and employee documents, and also load all the products in all the lines in the
order. As you can see, this is a pretty powerful feature.
As powerful as it is, one of the most common issues that we run into with
RavenDB is people coming into RavenDB with a relational mindset, trying
to use RavenDB as if it was a relational database and modeling their entities
accordingly. Includes can help push you that way, because they let you get
associated documents easily.
Well talk about modeling in a lot more depth in chapter 4, when you have
learned enough about the kind of environment that RavenDB oers to make
sense of the choices we make. For now, Ill point out that RavenDB does not
support tertiary includes. That is, there is no way in RavenDB6 to load an
order, its associated products through the orders lines as well as those products
categories.
3.3.2.3 Delete
Deleting a document is done through the confusingly named method Delete.
This method can accept an entity instance or a document id. The following are
various ways to delete a document:
var c a t = s e s s i o n . Load<Category >( c a t e g o r i e s / 1 ) ;
s e s s i o n . Delete ( cat ) ;
6 Not quite true, you can use a transformer to do that, see Chapter 5, but it isnt recommended to do so.

3.3. THE BASICS OF THE CLIENT API

49

s e s s i o n . D e l e t e <Product > ( 1 ) ;
s e s s i o n . Delete ( orders /1);
It is important to note that calling Delete doesnt actually delete the document. It merely marks that document as deleted in the session. It is only when
SaveChanges is called that the document will be deleted.
3.3.2.4 Query
Querying is a large part of what a database does. Not surprisingly, queries
strongly relate to indexes, and well talk about those extensively in Chapter 5
and 6. In the meantime, let us see how we can query using RavenDB.
L i s t <Order> o r d e r s = (
from o i n s e s s i o n . Query<Order >()
where o . Company == companies / 1 )
select o
) . ToList ( ) ;
RavenDB is taking full advantage of Linq support in C#. This allows us to
express very natural queries on top of RavenDB in a strongly typed and safe
manner.
Because well dedicate quite a bit of time to talking about queries and indexes
later on, Ill be brief. Queries allow us to load documents that match a particular predicate. Like documents loaded via the Load call, documents that were
loaded via a Query are managed by the session. Modifying them and calling
SaveChanges will result in their update on the server.
And like the Load call, Query also supports include:
L i s t <Order> o r d e r s = (
from o i n s e s s i o n . Query<Order >()
. I n c l u d e ( x=>x . Company )
where o . Company == companies / 1 )
select o
) . ToList ( ) ;
You can now call Load<Company> on those companies and they will be served
directly from the session cache.
Queries in RavenDB dont behave like queries in a relational database. RavenDB
does not allow computation during queries, and it doesnt have problems with
table scans. Well touch on exactly why and the details about indexing in the
Chapters 5 and 6, but for now you can see that most queries will just work for
you.

50CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


3.3.2.5 Store
The Store command is how you associate an entity with the session. Usually,
this is done because you want to create a new document. We have already seen
that in Listing 3.5, but here is the relevant part:
Product p = new Product { . . . } ;
s e s s i o n . Store (p ) ;
s t r i n g p r o d u c t I d = p . Id ;
Like the Delete command, Store will only actually save the document to the
database when SaveChanges is called. However, it will give the new entity an
id immediately, so you can refer to it in other documents that youll save in the
same batch. Well discuss id generation strategies in Chapter 3.
Beyond saving a new entity, Store is also used to associate entities of existing
documents with the session. This is common in web applications. You have one
endpoint that sends the entity to the user, who modify that entity and then
sends it back to your web application. You have a live entity instance, but it is
not loaded by a session or tracked by it. At that point, you can call Store on
that entity, and because it doesnt have a null document id, it will be treated as
an existing document and overwrite the previous version on the database side.
Store can also be used in optimistic concurrency scenarios, but well talk about
this in more details in Chapter 3.
3.3.2.6 SaveChanges
The SaveChanges call will check the session state for all deletion and changes,
and send all of those to the server as a single remote call that will complete
transactionally. In other words, either all the changes are saved as a single unit,
or none of them do.
Remember that the session has an internal map of all the loaded entities. When
you call SaveChanges, those loaded entities are checked against the entity as it
was when it was loaded from the database. If there are any changes, that entity
will be saved to the database.
It is important to understand that any change would force the entire entity to
be saved. We dont attempt to make partial document updated in SaveChanges.
An entity is always saved to a document as a single full change.
The typical way one would work with the session is:
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
// do some work with t h e s e s s i o n
s e s s i o n . SaveChanges ( ) ;

3.3. THE BASICS OF THE CLIENT API

51

}
So SaveChanges is called only once per session. In web scenarios, this is typically handled in the controller. Listing 3.6 shows example of base RavenDB
controllers for ASP.NET Web API and ASP.NET MVC. Both samples show
a common pattern for working with RavenDB, we have the infrastructure (in
this case, the base controller) take care of opening the session for us, as well as
calling the SaveChanges method if there has been no errors.
Listing 3.6: Base controllers classes for RavenDB
p u b l i c a b s t r a c t c l a s s BaseRavenDBController : C o n t r o l l e r
{
p u b l i c IDocumentSession DocumentSession { g e t ; s e t ; }
p r o t e c t e d o v e r r i d e v o i d OnActionExecuting (
ActionExecutingContext f i l t e r C o n t e x t )
{
DocumentSession = DocumentStoreHolder . S t o r e .
OpenSession ( ) ;
}
p r o t e c t e d o v e r r i d e v o i d OnActionExecuted (
ActionExecutedContext f i l t e r C o n t e x t )
{
u s i n g ( DocumentSession )
{
i f ( DocumentSession == n u l l | |
f i l t e r C o n t e x t . E x c e p t i o n != n u l l )
return ;
DocumentSession . SaveChanges ( ) ;
}
}
}
p u b l i c a b s t r a c t c l a s s BaseRavenDBApiController : A p i C o n t r o l l e r
{
p u b l i c IAsyncDocumentSession DocumentSession { g e t ; s e t ; }
p u b l i c o v e r r i d e async Task<HttpResponseMessage> ExecuteAsync (
H t t p C o n t r o l l e r C o n t e x t ctx ,
CancellationToken cancel )
{
u s i n g ( var s e s s i o n = DocumentStoreHolder . S t o r e . OpenAsyncSession ( ) )
{
var message = a w a i t b a s e . ExecuteAsync ( ctx , c a n c e l ) ;
a w a i t s e s s i o n . SaveChangesAsync ( ) ;

52CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


r e t u r n message ;
}
}
}
This tends to greatly simplify how we are actually working with the database,
because we dont have to remember calling SaveChanges or manage the session
ourselves. We also make sure that we have exactly one session for the duration
of the request, which is also a good practice. In addition, this makes it very
easy to test code that uses RavenDB. Well talk about unit testing RavenDB
code in Chapter 9.
With this, we conclude the public surface area of the session. Those methods
allow us to do about 90% of everything that you could wish for with RavenDB.
For the other 10%, we need to look at the Advanced property.
3.3.2.7 Advanced
The surface area of the session was quite carefully designed so that the common
operations were just a method call away from the session, and that there would
be few of them. But while this covers a lot of the most common scenarios, that
isnt enough for a high quality product like RavenDB.
All of the extra options are hiding inside the Advanced property. You can use
that to congure the behavior of optimistic concurrency via:
s e s s i o n . Advanced . U s e O p t i m i s t i c C o n c u r r e n c y = t r u e ;
Force a re-load an entity from the database to get the changes has been made
there since that entity was loaded:
s e s s i o n . Advanced . R e f r e s h ( p r o d u c t ) ;
You can make the session forget about an entity:
s e s s i o n . Advanced . E v i c t ( p ro d u c t ) ;
Im not going to go over the Advanced options here. They are quite a few, and
they are quite literally documented on the API itself. Well on the relevant
parts during the rest of the book. But it is still worth your time to inspect what
else is there, even if you will rarely have to use that.

3.3.3 Database commands


The session is a high level interface to RavenDB. It has the identity map, it
has Linq queries, and it does pretty much everything so you wont have to deal

3.3. THE BASICS OF THE CLIENT API

53

with the low level stu on a regular basis. But when you do, this is where the
database commands come into place.
RavenDB is exposed over the network using a REST API. And you can absolutely make use of REST calls directly. We have several customers that are
using REST calls from PowerShell to administer RavenDB. That is ne, and
works great, but usually we can do better.
The Database Commands expose a low level API against RavenDB that is much
nicer than raw REST calls. For example, I might want to check if a potentially
large document exists, without loading it. I can do that using:
var cmds = DocumentStoreHolder . S t o r e . DatabaseCommands ;
var docMetadata = cmds . Head ( p r o d u c t s / 1 ) ;
i f ( docMetadata != n u l l )
C on s o l e . WriteLine ( document e x i s t s ) ;
You can use the Database Commands to get the database statistics, generate
identity values, get the indexes and transformers on the server, issue patch
commands, etc.
The reason that they are exposed to you is that the RavenDB API, at all levels,
is built with the notion of layers. The expectation is that youll usually work
with the highest layer, the session API. But since we cant predict all things,
we also provide access to the lower level API, on top of which the session API
is built, so you can fall down to that if you need to.
For the most part, that is very rarely needed, but it is good to know that this
is available, just in case.

3.3.4 Working with async


So far, we have shown only synchronous work with the client API. But async
support is crucial for high performance applications, and RavenDB has full
support for that. The async API is exposed via two major endpoints, the async
session and the async database commands.
In all respects, they are identical to the sync versions. Listing 3.7 shows how
we can save and load a document using the async session.
Listing 3.7: Working with the async session
s t r i n g productId ;
u s i n g ( var s e s s i o n = DocumentStoreHolder . S t o r e . OpenAsyncSession ( ) )
{
var p r o d u c t = new Product {Name = Async Goodness } ;
a w a i t s e s s i o n . StoreAsync ( p r o d u c t ) ;
p r o d u c t I d = p r o d uc t . Id ;

54CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE


a w a i t s e s s i o n . SaveChangesAsync ( ) ;
}
u s i n g ( var s e s s i o n = DocumentStoreHolder . S t o r e . OpenAsyncSession ( ) )
{
var p r o du c t = a w a i t s e s s i o n . LoadAsync<Product >( p r o d u c t I d ) ;
C o n s ol e . WriteLine ( p r o d u c t . Name ) ;
}
Except for the await unary operators and the Async axes, this is pretty much
the same thing as the sync version. When querying using the async session, you
use the ToListAsync method.
var p r o d u c t s = a w a i t s e s s i o n . Query<Product >()
. Where ( x => x . Name == Async Goodness )
. ToListAsync ( ) ;
RavenDB splits the sync and async API because their use cases are quite different, and having separate APIs prevent you from doing some operations synchronously and some operations asynchronously. Because of that, you cannot
mix and use the synchronous session with async calls, or vice versa. You can use
either mode in your application, depending on the environment you are using.
Aside from the minor required API changes, they are completely identical.
The async support if very deep, all the way to the I/O issued to the server. In
fact, the synchronous API is built on top of the async API and async I/O.
We covered the basics of working with the RavendB Client API in this section,
but that was mostly the mechanics. Well dive deeper into using RavenDB in
the next chapter, where well also learn about the details and how it is all put
together.

3.4 Whatcha doin with my data?


When you put data into RavenDB, where does it goes? Since RavenDB is
transactional and ACID, the disk must be involved, but what exactly is going
on there?
Server side RavenDB is built of layers, just like the client API. The data goes
into RavenDB and eventually it reaches the storage engine layer. This storage
engine is responsible for transactions, safety and isolation. It is there that
RavenDBs safety originate from.
RavenDB ships with two storage engines. Esent and Voron. Both of them are
ACID (Atomic, Consistent, Isolated and Durable), support snapshot isolation
and has the kind of high performance RavenDB needs to deliver top notch
service to your systems.

3.4. WHATCHA DOIN WITH MY DATA?

55

Esent stands for the Extensible Storage Engine (ESE), and it is also known as
JET Blue. It is a core component in Windows, and forms the basis for services
such as Active Directory, Exchange and many other Windows components. It is
a robust and production tested storage engine, and has been used in RavenDB
since the very start.
Voron (Russian for Raven) is an independently developed storage engine that
was created by Hibernating Rhinos. It takes a lot from LevelDB and LMDB,
but its internal structure is quite dierent. It is providing high performance read
and writes and has full ACID support. Voron is our next generation storage
technology, and it lies at the core of several upcoming features for RavenDB.
When you create a new database with RavenDB, you have the option of selecting
the storage engine. Before RavenDB 3.0, you could had Esent as the storage
engine. Now you have to make a choice. Esent has more time in the eld, and in
general has been proven to be a pretty good choice. It does suer from several
issues, chief among them is that you cannot easily move the database between
machines. That is because Esent is tied to the Windows version, so you cant
take a database from Windows 2012 server and open it on a Windows 8 machine.
Another issue is that Esent is tied to the actual machine locale, and may require
a defrag state when moving between machines with dierent locales.
Voron, on the other hand, was built to avoid all those issues, and you can move
it between machines with no issues. It also tend to be faster than Esent for
most purposes. Voron is optimized for 64 bits, and while it can run in 32 bits
systems, its database size is very limited in those scenarios. Voron has also a
lot less real world experience than Esent.
The conservative choice would be to go with Esent for the time being, even
though Voron is what we are aiming at in the future. New features in RavenDB
(distributed counters, event storage, etc.) are coming down the pipe that will
be Voron only. And, of course, Hibernating Rhinos own internal systems are
running using Voron.

3.4.1 Where is the data actually stored?


Regardless of the actual storage engine used, all the data is stored in a directory
specied when creating the database. If a directory isnt specied, well use the
following path: $DataPath\Databases\$DbName. The $DataPath is the default
location for all the databases, usually that is the server executable location, but
you can congure that using Raven\DataDir in the App.cong le.
For more details about what is going on behind the scenes, you can look at
Chapter 8, which talks about operations and production usage.

56CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE

3.5 Running in Production


So far, we have been running RavenDB in debug mode, as a console application.
While this makes it very easy to run RavenDB, it is obviously not a suitable way
to run RavenDB for production use. In production, we usually run RavenDB
as an IIS Application or as a Windows Service.
Well cover RavenDB in production in a lot more depth in Chapter
8, Operations.
For production use, you can go to the RavenDB Download Page and download
the installer package. You can see how it looks in Figure 4.

Figure 3.4: The RavenDB Installer


This installer gives you the option of selecting a Windows Service or IIS, and
takes care of all the details of installing RavenDB for you. Note that installing
RavenDB for production use requires a license. You can also install RavenDB
on your machine for development usage, which can be done without a license.
An unlicensed copy of RavenDB cannot use authentication, and will consider
any incoming request as those made by the system administrator.
There isnt much of a dierence between running RavenDB as IIS Application
and as Windows Service. When running in IIS, you can take advantage of
the additional monitoring capabilities that IIS provides. However, you are also
saddled with its limitation (request length and time limits, startup/shutdown
time quotas, etc.). When running as a Windows Service, RavenDB is responsible
for everything, including managing the HTTP layer.

3.6. SUMMARY

57

Running as part of a Windows Cluster require that youll run as a Windows


Service. But other than that, the preference tends to be to run in IIS for the
additional tooling and management support that it provides.

3.6 Summary
In this chapter, we talked about getting started with RavenDB. We installed
RavenDB from scratch, then talked to it using the Document Store and Document Session. We have also explored the client API, what it can do and how to
best utilize it.
You should have a single document store instance in your application, and use
that to create a single session per request (or per message). Weve also seen
some sample code to handle that for common scenarios such as ASP.Net Web
API and ASP.Net MVC. We covered how to do basic CRUD with RavenDB
using the session and explored the layered structured of RavenDB.
The session is the highest layer of the Client API, then we have the database
commands and nally the raw REST calls over HTTP. That layered architecture
is also present on the server side, and one such example is the storage engines
that we looked at, Esent and Voron.
We touched briey on running RavenDB in production, just to get you started,
and well talk about this a lot more in Chapter 8, Operations.
In the next chapter, well talk more about various concepts inside RavenDB.
Well see how everything is put together, and how you can best take advantage
of that.

58CHAPTER 3. ZERO TO 60 WITH RAVENDB, FROM INSTALLATION TO USAGE

Chapter 4

RavenDB concepts
We have a running instance of RavenDB, and we have already seen how we can
put and get information out of our database. But we are still only just scratching
the surface of what we need to know to make eective use of RavenDB. In this
chapter, well go over the major concepts inside RavenDB.
The rst step along the way is to understand what documents are.

4.1 Entities, Aggregate Roots and Documents


When using a relational database, you are used to using Entities (hydrated
instances of rows from various tables) that makes up a single Aggregate Root.
There is also the relatively minor concept of Value Objects, but those tend to be
underused, because you have to have a distinct identity for many things in your
domain that dont really require it. The classic example is the order line. It has
no independent existence, and it must be modied in concert with the whole
order. Yet, in a relational database, an order line must have its own primary
key, and it is entirely feasible to change an order independently of the order it
is associated with.
Well dedicate Chapter 4, Document based modeling, for a full discussion on
modeling behavior inside RavenDB, but here are the basics. In RavenDB, every
document is an Aggregate Root. In fact, we generally dont even bother calling
them Aggregate Roots, and just call them Entities. The distinction between an
Aggregate and an Entity is only there because of the limitations of relational
databases.
An Entity is-a document, and it isnt limited to simple structures such as a
key/value map. You can model very complex structures inside a document.
In the order and order line case, well not model the order and order lines
59

60

CHAPTER 4. RAVENDB CONCEPTS

independently. Instead, the order lines will be embedded inside the order. Thus,
whenever we want to load the order, well get all of the order lines with it.
And modication to an order line (be it updating, removing or adding) is a
modication to the order as a whole, as it should be.
The order line is now a Value Type, an object that only has meaning within
its parent object, not independently. This has a lot of interesting implications.
You dont have to worry about Coarse Grain Locking1 or partial entity updates.
Rules like external references should only be to aggregates are automatically
enforced, simply because documents are aggregates, and they are the only thing
you can reference.
Documents are independent and coherent. What does those mean? When
designing the document structure, you should strive toward creating a document
that can be understood in isolation. You should be able to perform operations
on a document by loading that single document and operating on it alone. It
is rare in RavenDB to need to reference additional documents during write
operations. That is enough modeling for now, well continue talking about that
in the next chapter. Now we are going to go beyond the single document scope,
and look at what a collection of documents are.

4.2 Collections
On the face of it, it is pretty easy to explain collections. See Figure 1 as a good
example.
It is tempting to think about collections as a set of documents that has the same
structure and are stored in the same location. That is not the case, however.
Two documents in the same collection can be utterly dierent from one another
in their internal structure. See Figure 2 for one such example.
Because RavenDB is schemaless, there is no issue with doing this, and the
database will accept and work with such documents with ease. This allows
RavenDB to handle dynamic and user generated content without any of the
hard work that is usually associated with such datasets. It is pretty common to
replace EAV2 systems with RavenDB, because it make such systems very easy
to build and use.
RavenDB stores all the documents in the same physical location, and
the collection association is actually just a dierent metadata value. The
RavenEntityName metadata value controls which collection a particular
1 You might notice a lot of terms from the Domain Driven Design book used here, that is
quite intentional. When we created RavenDB, we intentionally made sure that DDD applications would be a natural use case for RavenDB.
2 Entity-Attribute-Value schemas, the common way to handle dynamic data in relational
databases. Also notorious for being hard to use, very expensive to query and in general a
trouble area you dont want to go into.

4.2. COLLECTIONS

Figure 4.1: The collections in the Northwind database

Figure 4.2: Two dierently structured documents in the Users collection

61

62

CHAPTER 4. RAVENDB CONCEPTS

document will belong to. Being a metadata value, it is something that is fully
under you control.
Collections & document identiers
It is common to have the collection name as part of the document
id. So a document in the Products collection will have the id of
products/. That is just a convention, and you can have a document in the Products collection (because its metadata has the
RavenEntityName value set to Products) while it has the name
bluebell/buttery.
RavenDB does use the collections information to optimize internal operations.
Changing the collection once the document is created is not supported. If you
need to do that, youll need delete the document and create it with the same id,
but a dierent collection.
Weve talked about the collection value in the metadata, but we havent actually
talked about what is the metadata. Let talk meta.

4.3 Metadata
The document data is composed of whatever it is that youre storing in the
document. For the order document, that would be the shipping details, the
order lines, who the customer is, the order priority, etc. But you also need a
place to store additional information, not related to the document itself, but
about the document. This is where the metadata comes into place.
The metadata is also a JSON format, just like the document data itself. However,
there are some limitations. The property names follow the HTTP Headers
convention of being Pascal-Cased. In other words, we separate words with a
dash and the rst letter of each word is capitalized, everything else is in lower
case. This is enforced by RavenDB.
RavenDB uses the metadata to store several pieces of information about the
document that it keeps track of:
The collection name - stored in the RavenEntityName metadata property.
The last modied date - stored in the LastModied metadata property3 .
The client side type - stored in the RavenClrType metadata property.
The etag - stored in the @etag metadata property, and discussed at length
later in this chapter.
3 This is actually stored twice, once as LastModied and once as RavenLastModied,
the rst is following the RFC 2616 format and is only accurate to the second. The second is
accurate to the millisecond.

4.4. DOCUMENT IDENTIFIERS

63

You can use the metadata to store your own values, for example, LastModiedBy
is a common metadata property that is added when you want to track who
changed a document. From the client side, you can access the document
metadata using the following code:
Product p r o d u c t = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
RavenJObject metadata = s e s s i o n . Advanced . GetMetadataFor ( p r o d u c t ) ;
metadata [ LastModified By ] = c u r r e n t U s e r . Name ;
It is important to note that there will be no extra call to the database to fetch
the metadata. Whenever you load the document, the metadata is fetched as
well. In fact, we usually need the metadata to materialized the document into
an entity.
Changing a document collection
RavenDB does not support changing collections. While it is possible
to change the metadata value for RavenEntityName, doing so is
going to cause issues.
We have a lot of optimizations internally to avoid extra work based
on the collection name, and no support whatsoever for changing it.
Weve tried to add support for this, or even just at out error when
/ if you try to change the collection name, but either option proved
to be too expensive.
When you save a document, RavenDB can just throw the data into
the disk as fast as possible, needing to check for a previous collection
name has proven to be very expensive (needing to do a read per
write) and hurt our performance, especially in bulk insert mode.
If you need to change a documents collection, the supported way to
do that is to delete it and then save it again, with the same document
id.
Once you have the metadata, you can modify it as you wish, as seen in the last
line of code. The session tracks changes to both the document and its metadata,
and changes to either one of those will cause the document to be updated on
the server once SaveChanges has been called.
Modifying the metadata in this fashion is possible, but it is pretty rare to do
so explicitly in your code. Instead, youll usually use listeners to do this sort of
work.

4.4 Document Identiers


A document id in RavenDB is how we identify a single document from all the
rest. They are the moral equivalent for the primary key in a relational system.

64

CHAPTER 4. RAVENDB CONCEPTS

Unlike a primary key, which is unique per table, all the documents in a database
share the same key space4 .
Identiers terminology
Document identiers are also called document keys, or just ids or
keys. In the nomenclature of RavenDB, we use both keys and ids to
refer to a document id.

4.4.1 Dont use Guids


Therefore, it follows that one of the chief requirements of the document ids is
that they would be unique. This turns out to be a not so trivial problem to
solve. A simple way to handle that is to use a Guid, such as this one:
92260D13A0324BCC9D1810749898AE1C
It is entirely possible to use Guids as document identiers in RavenDB. But it
is also possible to drive to work on a unicycle. Possible doesnt mean advisable.
Guids are used because they are easy to generate, but they suer from weaknesses when it comes to their use as a unique identier, they are relatively big
compare to other methods and are nonsequential.
Those two means that it is easy to get a database into a situation where it has to
do a lot more work just to get data in when you are using a Guid. But that isnt
their chief problem. The problem is that they are utterly opaque to humans.
We often use identiers for many purposes. Debugging and troubleshooting are
not the least of those.
And having to look at 92260D13-A032-4BBC-9D18-10749898AE1C and see
what we did with it along the way is not a really good way to spend your time.
If you ever had to read a Guid over the phone, or keep track of multiple Guids
in a log le, or just didnt realize that the Guid in this paragraph and the Guid
higher up in the page arent in fact the same Guid
Guids arent good for us. And by us, I mean humans.

4.4.2 Human readable identiers


A much better alternative is the default approach used by RavenDB, using the
collection name as a prex with a numeric id to distinguish dierent documents.
Youve already seen examples of this default approach. We have products/1,
orders/15, etc.
4 Remember,

collections are a virtual concept

4.4. DOCUMENT IDENTIFIERS

65

This approach has several advantages. It tends to generate small and sequential
keys, and most importantly, these type of keys are human readable and easily
understood.
The question now is, how do we get this numeric sux?

4.4.3 High/low algorithm


The problem with generating unique values is that you might not be the only
one that want to generate them at this particular moment in time. So we have
to ensure that we dont get duplicates.
One way to do that is to use a single source for id generation, which will be
responsible for never handing out a duplicate value. RavenDB supports that
option, and you can read about it in the next section, Identity. However, such
an approach requires going to the same source each and every time that we need
to generate an identier.
The default approach used by RavenDB is quite dierent. We use a set of
documents call the hilo documents. Here is a list of those documents in the
Northwind database:

Raven/Hilo/categories
Raven/Hilo/companies
Raven/Hilo/employees
Raven/Hilo/orders
Raven/Hilo/products
Raven/Hilo/regions
Raven/Hilo/shippers
Raven/Hilo/suppliers

Those are pretty trivial documents, they all have just a single property, Max.
That propertys value is the maximum possible number that has been generated
(or will be generated) for that collection. When we need to generate a new
identier for a particular collection, we fetch that document and get the current
max value. We then add to that max value and update the document.
We now have a range, between the old max value and the updated max value.
Within this range, we are free to generate identier with the assurance that no
one else can generate such an identier as well.
The benet of this approach is that this also generates roughly sequential keys,
even in the presence of multiple clients generating identiers concurrently.

66

CHAPTER 4. RAVENDB CONCEPTS

4.4.3.1 Self optimizing


The basis of the hilo algorithm is that a client that needs to generate 10 ids, it
can take a range of 10 when it communicate with the server. From then on, it
can generate those ids independently from the server until it runs out of ids.
Of course, you usually dont know upfront how many ids youll want to generate,
so you guess. By default, we use 32 as the default range, but the nice thing
about the hilo approach is that it is the client that controls how much to take.
Remember how we said that RavenDB is self-optimizing? Here is one such case.
When the client runs out of the reserved id range, it has to go back to the server
to get a new reserved range. When it does so, it checks to see how much time
has passed since the last time it had to go to the server. If the time is too short,
that is an indication that we are burning through a lot of ids. In order to reduce
the number of remote calls, the client will then request a range twice as big as
before.
In other words, we start by requesting a range of 32. We consume that quickly,
and we request a range of 64, and so on. Very quickly, we nd the balance where
we dont have to go to the server too often to get new ids.
The actual mechanics are a bit more complex, because we scale up higher than
just by a power of two in practice. We also have the ability to reduce the size
of the range we request if we arent going through the range fast enough.
The actual details of how it works is not part of the algorithm, those are internal
implementation optimization detail. But it is important to understand the
benets that you get when using this.
4.4.3.2 Concurrency
So far, we talked about hilo as if there was just a single client talking to the
server. The question is, what happens when there are two clients requesting a
range in the same time.
Requesting a hilo range
While the terminology we use is requesting a range, the RavenDB
server isnt actually aware of the hilo protocol in any meaningful way.
Requesting a range is a process that involved loading the document,
updating the Max value and saving it back.
Aside from those basic operations, all the behavior for hilo is implemented client side.
As youll see toward the end of the chapter, RavenDB supports optimistic concurrency, and the hilo protocol takes advantage of this. When we save the

4.4. DOCUMENT IDENTIFIERS

67

document back after raising the Max value, we do so in a way that would throw
a ConcurrencyException if the document has been changed in the meantime. If
we get this error, we retry the entire process from the beginning, fetching the
document again, recording the current Max value, then saving with the new
Max.
This way, we are protected against multiple clients overwriting one anothers
changes and generating duplicate ids.
4.4.3.3 Distributed hilo
Using the hilo algorithm, we only have to go back to the database once we run
out of ids in the range we reserved. But what happens when we cannot contact
that database? Well touch on distribution model later on, in Part 3, Scale Out,
but I do want to expand on how this relates to hilo at this time.
Assuming that we have a RavenDB cluster made of 3 nodes. We will congure
each node to have its own unique hilo prex. This way, if the primary node is
down, we can still reserve ranges, and we dont have to worry about reserving
the same range as another client because of a network failover.
Well discuss such scenarios extensively in Part 3, for now, all you really care
about is that you can use the hilo system in a cluster without worrying about
a single point of failure.
4.4.3.4 Manual hilo
You dont need to do anything special to use the hilo algorithm. It is what the
RavenDB Client API does by default. It generate ids that have the following
format:
orders/13823
products/7371
PackageTracking/38248225
But sometimes you want to just have the numeric id and work with that. Maybe
you are working with internal ids (see Chapter 5, Modeling) or using semantic
ids (see a bit later in this chapter) but for whatever reason, you want to be able
to generate those hilo values yourself.
You dont need to start implementing everything from scratch. You can just
write the following code:
5 When the entity name is composed of a single word, well default to lower casing it, when
it is composed of multiple words, well preserve the casing.

68

CHAPTER 4. RAVENDB CONCEPTS

HiLoKeyGenerator h i l o K e y G e n e r a t o r = new HiLoKeyGenerator ( t a g s , 8 ) ;


l o n g i d = h i l o K e y G e n e r a t o r . NextId ( documentStore . DatabaseCommands ) ;
The hiloKeyGenerator instance should be a singleton, do not try to create a new
hilo whenever you need an id. That would require us to reserve a new range
every time. The HiLoKeyGenerator constructor accepts the hilo name and the
initial size of the range. The actual size of the range will change, according to
your actual usage.
Most of the time, when we call NextId, we will never have to go to the server,
we can just increment the internal value and as long as we are in range, we are
good. This is eectively what the RavenDB Client API does for you, but we
usually default to a range of 32.
You can use the generated id to do quite a lot of good, and you benet from
not having to go to the server all the time.

4.4.3.5 The downside for hilo


The hilo algorithm is the default approach for generating ids in RavenDB for a
reason. It is simple, ecient and scalable and it generates very nice identiers.
Is there a downside? Of course there is, if only because someone ate my free
lunch.
The one downside for hilo is that it can generate nonconsecutive ids. What do
I mean by that? Let us assume that we need to save several new orders. Using
hilo, we generate the following ids for them: orders/65, orders/66, orders/67.
Then we restart the application, and save some more orders.
Because hilo reserve a range, and there is no way to unreserve part of that range,
that means that this range has been lost. In this case, we reserved the range
65 - 96, but after generating just 3 such ids, we have been restarted. The range
has been lost, and now well generate ids such as: orders/97, orders/98, etc.
In practice, that isnt really a big downside. You might lost a few ids if the
application restarted in production, but youll likely not notice that. It is most
often in development, where it is common to restart the application frequently,
that people notice and wonder about this behavior.
But the actual reason that this isnt a big issue is that the ids that the RavenDB
Client API generates arent meaningful on their own. It doesnt actually matter
if an order documents id is: orders/58 or orders/61. So skipping ids isnt
something that we are generally need to concern ourselves with. When we do,
we have the identity option.

4.4. DOCUMENT IDENTIFIERS

69

4.4.4 Identity
If you really need to have consecutive ids, you can use the identity option.
Identity, just like in a relational database (sometimes called sequence) is a simple
always incrementing value. Unlike the hilo option, you always have to go to the
server to generate such a value.
There are two ways to generate an identity value. The rst is to do so implicitly,
as shown in Listing 4.
Listing 4.1: Using implicit identity
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p r o d u c t = new Product
{
Id = p r o d u c t s / ,
Name = What s my i d ?
};
s e s s i o n . S t o r e ( p r o du c t ) ;
C on s o l e . WriteLine ( p r o d u c t . Id ) ;
// output : p r o d u c t s /
s e s s i o n . SaveChanges ( ) ;
C on s o l e . WriteLine ( p r o d u c t . Id ) ;
// output : p r o d u c t s /78
}
You can see that we actually dene an id for the product. But a document
id that ends with a slash (/) isnt allowed in RavenDB. We treat such an id
as an indication that we need to generate an identity. That has an interesting
implication.
How identities are stored?
Identities are stored6 as a list of tuples containing the identity key
and the last value. This data is persistent. In other words, if you
delete the latest document, you wont get the same id back.
As a result of that, identities are actually created lazily, the rst
time we need them, and the rst value generated is always one and
there is no way to set a dierent step size for identities. This raises
the question, what happens if we create a document with the id
products/1 manually, then try to save a document with the id
products/?
RavenDB is smart enough to recognize this scenario, and it will
generate a non colliding id in an ecient manner. In this case, well
get the id products/2
6 This

is a conceptual description, the actual storage is quite dierent

70

CHAPTER 4. RAVENDB CONCEPTS

We dont go to the server until we are actually calling SaveChanges. That means
that we dont know what the actual document id is until after we already got
the reply from the server. That isnt fun, but on the other hand, we can save
multiple documents using identity without having to go to the server for each
of them individually.
The other way to use identities is to do so explicitly. You can do that using the
following code:
l o n g n e x t I d e n t i t y = documentStore . DatabaseCommands
. NextIdentityFor ( i n v o i c e s ) ;
This allows you to construct the full document id on the client side. But it does
require two trips to the database, one to fetch the identity value and the second
to actually save it. There is no way to get multiple identity values in a single
request.
You can set the identity next value using this command:
l o n g n e x t I d e n t i t y = documentStore . DatabaseCommands
. SeedIdentityFor ( i n v o i c e s , 654);
Invoices, and other tax annoyances
For the most part, unless you are using semantics ids (covered later in
this chapter), you shouldnt care what your document id is. The one
case you care is when you have an outside requirement to generate
absolute consecutive ids. One such common case is when you need
to generate invoices.
Most tax authorities have rules about not missing invoice numbers,
to make it just a tad easier to actual audit your system. But an
invoice documents identier and the invoice number are two very
dierent things.
It is entirely possible to have the document id of invoices/843 for
invoice number 523.
4.4.4.1 The downsides for identity
There is no such thing as a free lunch, and identity also has its own set of drawbacks. Chief among them is that identities are actually not stored as a document.
Instead, they are stored internally in a way that isnt quite so friendly.
That means that exporting and importing the database would not also carry over
the identities values. The identities values are also not replicated, so identity
isnt suitable for use in a cluster.
Finally, modifying an identity happens in a separate transaction than the current transaction. In other words, if we try to save a document with the name

4.4. DOCUMENT IDENTIFIERS

71

product/, and the transaction failed, the identity value is still incremented.
So even though identity generate consecutive numbers, it might still skip ids if
a transaction has been rollbacked.
Except for very specic requirements, such as an actual legal obligation to generate consecutive numbers, I would strongly recommend not using identity. Note
my wording here, a legal obligation doesnt arise because someone want consecutive ids because they are easier to grasp. Identity has a real cost associated
with it.

4.4.5 Semantic ids


Document ids in RavenDB do not actually have to follow the products/43
format. A document id can be any string up to 1,024 Unicode characters7 , so
you have a lot more options here.
One common scenario where you want to generate your own semantic ids is
when you want to ensure that something is unique. Let us say that we wanted
to make sure that we had unique user names. We can do that by naming the
users documents with the actual user names:
users/ayende
users/john83
users/zebrrra
This does several things at once, it allows us to ensure that there can never be
a duplicate user name as well as allow us to load the document easily given just
the user name. What if we dont have a username in our system, but just use
the email8 ? We use the same approach:
users/ayende@ayende.com
users/john83@example.org
users/zebrrra@endofworld.left
Another reason to want semantic ids is to generate ids such as customers/483/transactions/201408-06. As you can probably tell, this document is for Customer #483 and it
contains all the transactions for Aug 6, 2014. Semantic ids are important in
the context of modeling, and are discussed in Chapter 5, Modeling.
Generating a semantic id is just a matter of setting the Id property of the
document before calling Store.
7 That doesnt mean that is a good idea to have a very long document id, long document
ids require us to allocate more resources and do a lot more work internally. Ideal document
ids are pretty short.
8 What happens when you want to have both username and email unique? That is where
the RavenDB Unique Constraint Bundle comes to the rescue. This is discussed in Chapter
11, Bundles.

72

CHAPTER 4. RAVENDB CONCEPTS

4.4.6 Working with document ids


By now, you have a pretty good idea about document ids, and how they work.
But that was almost entirely a discussion on how the client and the server
generate ids. We havent talked about actually working with them.
RavenDB uses the documentStore.Convenions.FindIdentityProperty convention
to gure out where the document is stored on your entities. By default, that is
a property (or eld) named Id (case sensitive). We have already talked about
how to customize that in the previous chapter.
You probably realized it, but it is important to mention it explicitly. Identiers in RavenDB are strings. That surprises people coming from relational
background, where they are used to ids just being integers (or evil guids).
Avoid public int Id get;set;
If you dene your Id property as an integer (or long, or Guid), everything will work. However, under the covers, the document id is still
a string. What happens is that the RavenDB Client API will pretend take your products/328 document id, strip o the rst part
(because the convention says that it is the collection name) and stick
the numeric part in the id.
This matches what a lot of people is familiar with from working
with relational database, and this feature is provided solely to make
it easier to migrate to RavenDB. The problem with this approach is
that the id is still a string. And there is a limit to how well we can
pretend that it is an integer.
This usually comes up in the context of indexes or transformers,
because those run on the server side, and we cant fake it out there.
However, now you have a disconnection in your model, your client
side code thinks it is an integer, and on the server side it is a string.
Since indexes and transformers are actually dened on the client
side, but executed server side, you can see how that would cause
issues.
Given a document id, the most ecient way to get it to your hands is to Load
it. Because it is a pretty common mistake to try to Query for a document id
(which is several times more expensive), such an attempt is blocked and will
throw9 .
A document id cannot be changed once created, and attempt to associate two
entity instances with the same document id, or attempt to change the document
id on the entity once it was loaded or stored would result in an exception.
9 You

this

can set the convention option AllowQueriesOnId to allow that if you really require

4.5. ETAGS

73

And that is quite enough about document identiers. Well now move to the
other crucial piece of information that every document has, the ETag.

4.5 ETags
An etag in RavenDB is a 128 bit number that is associated with a document.
Whenever a document is created or updated, an etag is assigned to that document. The etags are always incrementing, and they are heavily used inside
RavenDB. Among their usages:

Ensuring cache consistency


Optimistic concurrency
Indexing
RavenDB replication
Relational database replication
Incremental exports

An etag is associated with the document metadata, so whenever you load the
document, you also have the etag available for you. Retrieving the etag is easy,
all you have to do is:
Etag productEtag = s e s s i o n . Advanced . GetEtagFor ( p r o d u c t ) ;
On the client side, etags are used for optimistic concurrency control and for
caching. Well touch on optimistic concurrency in the next section, and caching
is the section after that. I want to focus on how we are using etags on the server
side.
The structure of an ETag
There is just one promise that we make about etags, and that
promise is that they are always incrementing. Anything else
is an implementation details. That said, it can be an interesting implementation detail. Let us take a look at an etag:
01000000000000010000000000000EB6.
This looks like a Guid, and indeed, this is a 128 bit number, which
is using the Guid format convention because it is convenient. This
is actually composed of the following parts:
(1) 01
(2) 000000-0000-0001
(3) 0000-000000000EB6

74

CHAPTER 4. RAVENDB CONCEPTS


The rst part is the etag type. 01 is etag for a document, this is
the most common etag youll run into. The second is the number of
database restarts, this value is incremented by one every time the
server restarts. In the case of our etag, it was generated on the rst
time the database was created.
The last part is the number of changes that happened during the
current database restart. All etags inside the same database restart
period are consecutive.
There isnt really much cause for you to care about the actual content
of an etag, but the question is raised often enough. Just remember
that the details are implementation details, and might change in the
future.

Because an etag is assigned to a document on every put10 , and because etags


are always incrementing, it is possible for RavenDB to iterate over documents in
their update order. So if I save products/1, products/2 and products/3 and then
update products/1, when Ill iterate over them using the etag, Ill get results in
the following order: products/2, products/3, products/1.
In fact, in the studio, when you are looking at all the documents, we default
to sorting the documents by their last update. We do that by iterating over
the documents in reverse update order, using the etags. But beyond the studio,
iterating over the documents in update order turns to be quite useful.
That is how RavenDB implements indexing and replication, among other features. The way it works is quite simple. We start from the empty etag
(00000000000000000000000000000000) and ask the storage engine to give
us all the documents that have an etag greater than that etag. RavenDB gives
us a batch of documents to process. After we process those documents, we take
the last document etag and remember it. Then we go back to the storage engine
and ask us to give us all the etags after that etag. Rinse, repeat, and you have
process through all the documents.
ETags in distributed system
An etag is only consistent within the node that dened it. RavenDB
ensures that an etag is always incrementing within the node, but
there is no coordination over etags in the cluster as a whole.
The reason that this works is that when a document is updated midway through
this operation, we will see it again (because its etag was changed to a higher
value). The actual behavior we have for indexing or replication is quite a bit
more complex, but this is the basis for that.
Now, let us see what other usages we have for etags
10 RavenDB

doesnt make a distinction between create or update.

4.6. OPTIMISTIC CONCURRENCY CONTROL

75

4.6 Optimistic Concurrency Control


What happens when we have two users that try to modify the same document
at the same time? A document is a unit of change, and as such, trying to modify
it concurrently is not allowed, so a ConcurrencyException will be thrown. But
actually managing to save the same document on the same instant to the server
is pretty rare in most systems. Usually you are a lot more worried about the
following scenario:
09:01 AM - John has loaded orders/3
09:02 AM - Martha has loaded orders/3
09:03 AM - John modify orders/3 and saves it
09:04 AM - Martha modify orders/3 and saves it
Note that at no point did we have any concurrency, each action happens at
dierent times. Because they happened at dierent times, RavenDB cant tell
that there is any issue, and Marthas changes will overwrite Johns changes.
This behavior is call the Last Write Wins scenario. It is pretty useful when we
dont have contention on our documents.
In many cases, we do want to detect and handle this scenario. You can ask
RavenDB to do just that using the following code:
s e s s i o n . Advanced . U s e O p t i m i s t i c C o n c u r r e n c y = t r u e ;
Once this is turned on, the session will make sure that when it sends the document to be saved in RavenDB, it will include the original etag it was fetched
with. When the server actually gets the document to be saved, it will compare
the specied etag to the current one, and it will throw a ConcurrencyException
if they do not match. The server will rollback the current transaction and send
the error to the client.
Why is Last Write Wins the default?
Because this is what customers asked for, very loudly. Now, sometimes we get to tell a customer that as much as he wants a specic
feature, that isnt something that is going to happen, because the
ramications of this feature will be destructive in the long run. Why
didnt we do the same here? How is this Safe by Default?
Well, Safe by Default doesnt mean that we need to protect you
by placing you in a padded room in a straightjacket. And there are
several very common scenarios that make having Last Write Wins
compelling. In particular, getting an entity through form binding
and saving it to RavenDB directly, with very little code. Consider
the following ASP.Net code:

76

CHAPTER 4. RAVENDB CONCEPTS


p u b l i c A c t i o n R e s u l t Edit ( Product p r o d u c t )
{
DocumentSession . S t o r e ( p r o d u c t ) ;
}
Here, we are storing the object that we got directly from the ASP.Net
model binding, without needing to rst load the object. This is a
very easy way11 to handle data access.
But because we havent loaded the document, we dont know what
its etag, so well have to assume it is a new one, except that is already
exists, so an error would be thrown, and users would be upset over
disturbing this common scenario.
There are other, similar scenarios, so we made the call to default to
Last Write Wins, and let you turn optimistic concurrency when you
need it.

The UseOptimisticConcurrency setting aects all the operations in the session12 ,


what happens if we want to have optimistic concurrency for only some documents?
You can force the RavenDB Client API to do an optimistic concurrency check
on a single document using the following code:
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p1 = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
var e t a g = s e s s i o n . Advanced . GetEtagFor ( p1 ) ;
s e s s i o n . S t o r e ( p1 , e t a g ) ;
var p2 = s e s s i o n . Load<Product >( p r o d u c t s / 2 ) ;
p1 . QuantityPerUnit +=10;
p2 . D i s c o n t i n u e d = t r u e ;
s e s s i o n . SaveChanges ( ) ;
}
We arent setting UseOptimisticConcurrency here, so we default to Last Write
Wins, and indeed, this is what will happen if someone makes a change to the
products/2 document. But because we explicitly called Store(p1, etag), the
11 Note that is has its pitfalls, such as lack of proper validation or business logic, but for
some CRUD scenarios, this is perfect.
12 That means both PUT and DELETE operations that are being generated from
SaveChanges. Optimistic concurrency is also available when patching documents, and will
be discussed there as part of Chapter 6.

4.6. OPTIMISTIC CONCURRENCY CONTROL

77

session will save any updated to it with optimistic concurrency13 . There arent
many scenarios where you actually want to have Last Write Wins for a particular
document and optimistic concurrency for another. This feature was created to
enable end-to-end optimistic concurrency.

4.6.1 End-to-end Optimistic concurrency


Just setting UseOptimisticConcurrency isnt going to be enough in most systems.
UseOptimisticConcurrency is relevant when all of the changes are done within
the same session. However, in most situations, you are going to load a document,
send it to the user, and then get a request in a few seconds or minutes that tell
us to do something to that document.
1>
2>
3>
4>
5>
6>
7>

Load document p r o d u c t s /2 and send


Load document p r o d u c t s /2 and send
Get update r e q u e s t f o r p r o d u c t s /2
Load t h e document i n a new s e s s i o n
Get update r e q u e s t f o r p r o d u c t s /2
Load t h e document i n a new s e s s i o n
John s u p d a te s a r e l o s t

i t t o John
i t Martha
from John
, update i t p e r John s i n s t r u c t i o n and s a v e i
from Martha
, update i t p e r Martha s i n s t r u c t i o n and s a v e

The problem is in how we dene what changed. In the scenario above, optimistic
concurrency is within the scope of a single session. In Marthas update case, we
loaded the document (which already had Johns update) and then saved it with
Marthas update. There is no problem as far as the database is concerned. The
problem is that Martha never saw Johns update, and as far as John & Martha
are concerned, this is just another case of Last Write Wins.
In order to handle this scenario from an end to end perspective, we need to
send to the client the etag of the document as well as the actual document
itself. And when we request an update to the document, well need to pass
along the original etag value. Listing 5 shows the server side code for end to
end concurrency using ASP.Net MVC.
Listing 4.2: End to end optimistic concurrency in RavenDB
p u b l i c A c t i o n R e s u l t LoadProduct ( i n t i d )
{
Product p r o d u ct = DocumentSession . Load<Product >( i d ) ;
Etag e t a g = DocumentSession . Advanced . GetEtagFor ( p r o d u c t ) ;
r e t u r n Json ( new
{
Document = product ,
Etag = e t a g . T o S tr i n g ( )
13 Note that if the document was not changed in the session, calling SaveChanges will not
throw if the document was modied server side.

78

CHAPTER 4. RAVENDB CONCEPTS


});

}
p u b l i c A c t i o n R e s u l t EditProduct ( Product product , s t r i n g e t a g )
{
Etag o r i g i n a l E t a g = Etag . Parse ( e t a g ) ;
DocumentSession . S t o r e ( product , o r i g i n a l E t a g ) ;
r e t u r n Json ( new { Saved = t r u e } ) ;
}
We are sending the etag along with the document, and we get the original etag
with the document. If the document has been changed between the LoadProduct
and EditProduct request, well detect it and can show an error to the user.
By now weve seen that etags are important for index, replication and optimistic
concurrency. There is another area where etags have a big role, caching. Let us
see how that works in RavenDB.

4.7 Caching
Caching is the very rst tool that we use when we want to improve any system
performance. In the context of a database, there are actually a lot of caching
involved. We are precomputing things and storing them on disk for later14 , we
cache indexes and documents in memory so we wont have to go to disk, and
we also handle caching on the client side.
You dont actually care about the server side caching, that is an implementation
detail. You certainly benet from that, but it has no impact on your day to day
operations. The client stu, however, it is very relevant. The rst cache youll
encounter in RavenDB in the session cache:
var p1 = DocumentSession . Query<Product >()
. Where ( p => p . Name = Chef Anton s Gumbo Mix )
. F i r s t ( ) ; // r e t u r n s p r o d u c t s /5
var p2= DocumentSession . Load<Product >( p r o d u c t s / 5 ) ;
Here, we are only going to go to the database once. We rst query the database
for a product with the specied name, then we are loading a document by id.
Since that document was already loaded into the session by the query, we could
skip going to the server entirely. This is a very short lived cache, and it isnt
actually there for performance. The main reason we have this behavior is so the
session will implement the Identity Map pattern. Because the lifespan of the
session is very short, you dont get a lot of utility of such a cache, but features
such as include help make it a very important optimization technique.
14 This

is the essence of how RavenDBs Map/Reduce work, for example

4.7. CACHING

79

Usually this is where you start using a cache provider to cache the results of
queries, so you can avoid making a remote call to the database in the rst place.
However, RavenDB is a full service database, and we see no reason why you
have to write caching code.
Hand rolled / ad hoc caching
Caching should be a pretty simple process:
Check the cache, if the value is there, return it.
Otherwise, load it, put it in the cache and return it.
For something that is supposed to be so simple, it is actually really
complex when you go into the details15 . If the objects that you hold
in the cache are mutable, then you cant actually return the same
instance from the cache in multiple calls, you open yourself up for
race conditions by having two threads fetch the same instance and
modify it concurrently.
Cache misses are also pretty complex. What happens if you have two
threads that have a cache miss at the same time on the same item?
Do both of them fetch the value (increasing load on the database)
or just one of them? Cache invalidation is a topic that requires you
to juggle multiple competing concerns (liveliness, performance, data
staleness, and more)).
Because of all those factors, caching code has the following properties:

Boring
Multi-threaded
Repeatable
Performance critical
Quite tricky to get right

Combine all those properties together and you can probably see why
writing caching code isnt something you want to do.
RavenDB exposes a REST interface to the outside world, and the nice thing
about REST and HTTP is that a lot of work has already been put into thinking about caching. RavenDB took full advantage of all of that work, and the
RavenDB HTTP cache was born.
15 The

devil is there, and it will poke you with a nasty pitchfork

80

CHAPTER 4. RAVENDB CONCEPTS

4.7.1 HTTP Caching


The HTTP specication has a lot to say about caching. Most of which is dealing
with how to ensure that the caches are correct enough. Inside RavenDB, we use
the ETag and If NoneMatch HTTP headers to create a very ecient caching
system. Let us see how it works.
You want to load a document (by calling DocumentSession.Load<Product>(1)),
which go through the session layer, to the database commands and nally to the
REST layer. At that point, the call is converted into an HTTP request to a URI
similar to this: http://rvndb05:8080/databases/Northwind/docs?id=products/1.
Note that the generated URI is very important, that generates a cache key that
we can check. In this case, this is the rst HTTP request for that URI, so
we have nothing in the cache for that. We make the request, and we get a
reply back. Remember the etags we talked about in the previous section? They
come into play here as well, because as part of the reply, we get the etag of the
document.
Now we can put the HTTP response into the cache, under this URI as the key.
The HTTP caches scope is the document store, not the session. And since a
DocumentStore instance is usually a singleton, that means that we have a single
cache for all our operations. The next time well make a request to this URI,
well have the result in the cache, so what will happen then?
At that point, we have a cache hit, and we could just immediately return the
result, but that wouldnt be safe. What happened if the document has changed
in the database? We would return an outdated document. That would never
work. What happens is that we make a request to the server, even though we
already have the previous response cached. But we send that request with an
If NoneMatch header set to the previous request etag.
On the server side, we check if you have sent the If NoneMatch header, and
can check if the document in question has changed (by just comparing the etag
from the client with the current etag). If the etag is the same, we can just return
a 304 Not Modied response. This approach doesnt save us from having to do
a network call, but it does save us from needing to transfer a lot of information
over the wire, when we already have it cached in memory.
Ive been talking about document loading, but the same process of using etags
and the If NoneMatch headers to check if the response has changed is used
throughout RavenDB. When you are making a query, doing a multi load, using
includes or fetching suggestions. The requests for all of those have an ETag
header set16 , and they have optimized code paths, that can just answer whatever
16 This is easy in the case of a document, just use the documents own etag. But what
happens if this is a single request for multiple documents, what would be the requests etag?
The answer is that we hash all the documents etag together, and use the result as the request
etag.

4.7. CACHING

81

or not the requests etag has changed, specically so the common case of no,
whatever you have in the cache is ne will be very fast.
Customizing the cache
It isnt common to need to modify the RavenDB HTTP cache, but
we dont believe in utterly blocking our users, so there is actually
quite a lot of control that you could assert over this process.
The cache is only active for GET requests, all other requests are
ignored. And by default we cache all of them. A cache that doesnt
have an eviction policy isnt a cache, it is a memory leak, and in this
case, the RavenDB cache using the Least Recently Used algorithm,
with a cap of 2,048 cached requests. You can change that by setting
the documentStore.MaxNumberOfCachedRequests property.
A cache trade o memory for time, so the higher the number of
cached requests, the more memory the cache will use, and the more
your application could serve from the cache. One scenario where
this is problematic is when you have many requests that return large
number of big documents. The result is that the cache is lled with
a lot of data, and that might cause issues.
You can ne tune what goes into the cache or not by using the
documentStore.Conventions.ShouldCacheRequest event.
When you make a request with If NoneMatch and the information has
changed, we just process the request normally, and send the results to you
along with the new etag. The end result is a system that is very fast, for the
common case you only need to check with the server if something has changed,
and you can save all the computation and bandwidth costs. At the same time,
you dont have to worry about cache invalidation or displaying out of date
results, because we conrm the accuracy of the results by checking with the
server.
Having both performance and safety is great, that is why we have made this the
default approach in RavenDB. But there is just one niggling issue still remaining.
We have to go to the database to check if our information is still up to date.
We can save the computational and bandwidth costs, but the latency of going
to the database is usually the most expensive part. That is why we have the
next level of caching.

4.7.2 Aggressive Caching


I was giving a talk once, and I asked the audience: what do you think aggressive
caching means? One guy immediately said: It is when the database beats back
with a stick anyone that wants to change the data. And while that would be a
nice feature, it isnt quite what we have in mind.

82

CHAPTER 4. RAVENDB CONCEPTS

Aggressive caching is an opt-in feature, and using it takes your caching to the
next level. Let us see the code in Listing 6 rst, then discuss it.
Listing 4.3: Using aggressive caching
u s i n g ( documentStore . A g g r e s s i v e l y C a c h e ( ) )
{
f o r ( i n t i = 0 ; i < 1 0 ; i ++)
{
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
var p r o d u c t = s e s s i o n . Load<Product >( p r o d u c t s / 1 ) ;
C on s o le . WriteLine ( p r o d u c t . Name ) ;
}
C on s o le . ReadLine ( ) ;
}
}
Looking at Listing 6 code, how many requests would you expect? Without
the AggressivelyCache call, we would expect there to be 10 requests. But with
it? With AggressivelyCache, we are only going to make a single request to the
server. Inside the aggressive cache scope, if we have a request in the cache, we
dont even check with the server if it is up to date or not. That means that we
can process pretty much whatever we want in memory, without ever having to
make a single remote call.
You really cant get any faster than serving directly from your own local memory.
Of course, there is a downside, because we never check with the server, which
might mean that someone will go and change the document on the server side,
and well miss this update. Now you can probably see why we are calling this
aggressive caching. Except that isnt how it really works.
Let us setup an experiment17 . Put the code in Listing 6 in a console application,
and hit enter a time or two. Then go and change the product name, and hit
enter again. Because we are using aggressive caching, we arent actually going
to go to the server and check that we have the latest version, so well expect to
see the same output. Instead, I got this:
Chai
Chai
Latte
We have aggressive caching enabled, and we didnt check with the server, but
we got the latest data anyway. How does that work?
17 I suggest looking at the RavenDB console, which shows a list of all the requests made to
the server. That can help you see when we are actually requesting data from the server and
when we are serving directly from the cache.

4.7. CACHING

83

Instead of having the RavenDB Client API check all the time if the cached
information is up to date (polling), we are reversing the ow (pushing). The
client asks the database to let it know if there has been any changes. As long as
it didnt get a change notication from the database, the client can safety serve
directly from the cache, with a high degree of condence that the information
it gives you is up to date. And when it does get a change notication, it merely
needs to use the default caching route, where we go to the server to check if our
results are actually up to date or not.
Cache complexity
It is easy to think that aggressive caching is going to use the details
in the change notications to selectively invalidate parts of the cache.
However, that isnt the case18 . The HTTP cache is pretty ignorant of
the way RavenDB works, and even if we tried adding the knowledge
to it, that would be very hard to handle.
Consider for example the case of changing a specic product document. It is easy to see that the url for loading that document
should be invalidated. But what about a url for documents by name?
Should it be invalidated? What about the url for an order that has
an include for that product?
It is impossible to try to answer those questions without doing so
much work that the benets of cache would be nil, so we dont,
instead we use a far simpler route, a change notication invalidate
the entire cache.
Setting up aggressive caching also setup the change notications subscription,
and getting a change notication will cause the entire cache to be invalidated.
That sounds scary, and far too aggressive19 . Any change? The whole cache?
That has got to pretty much kill this feature for real world purposes, right? But
we arent clearing the cache, most of the data we already have in the cache is
still going to be up to date. The way it works, each cache entry has a time,
which mark when it was fetched from the database. And we track the time of
the last change notication from the server. When the last change notication
from the server is greater than the time the cache entry was fetched, we have to
go to the database to make sure that this is still consistent.
Because of this behavior, aggressive caching is almost perfect. You can usually
serve the data directly from your own local cache, without having to make any
remote calls, and at the same time, you are notied of any changes and can check
with the server very quickly. Note that very quickly does not mean immediately.
While the actual latency between a document being changed and the client
18 I
19 I

originally spelled this: that isnt the cache, but decided that is was a bad pun
couldnt resist the pun this time

84

CHAPTER 4. RAVENDB CONCEPTS

being notied about this change is very short (in the order of milliseconds in
the common case), that isnt zero20 .
This means that it is possible for you to load the document from the cache even
though 2 milliseconds ago it was changed. This violates one of the basic tenants
of RavenDB, that access to the documents is fully ACID and immediately consistent. That is why you need to explicitly ask for this feature. I wholeheartedly
recommend taking advantage of this feature, but you need to consider what aspects of your code can accept potentially cached request, and what requires full
consistency.

4.8 Summary
Phew! This has been a long chapter. We covered a lot of the basic concepts
in RavenDB from Entities & Aggregate Roots to collections and metadata. We
then started to dive into deeper integration of your application with RavenDB.
On to identiers, what not to do (Guids) and the various choices that you have
with identiers. Most importantly, in my opinion, identiers should be human
readable. Because they are a key part of how you work with your system. The
RavenDB options we have, hilo, identity and semantic ids all follow the same
principle. They are workable for humans rst, and machines later.
After covering ids in great detail, we went on to etag, how they are composed and
what we do with them. We look at the server side use of indexing and replication
and at client side use of etags with optimistic concurrency (including end to end
optimistic concurrency).
Finally, we looked at caching, and saw that on the client side, RavenDB has
three dierent caching options. The session cache, which is primarily used for
Identity Map. The HTTP cache, which uses the etags and If NoneMatch
header to check if the information has changed server side. Even better, we
have Aggressive Caching, which can skip going to the server entirely, and can
use change notications to decide when to invalidate the cache.
On the next chapter, well continue to dive deeper into the client API. Well talk
about advanced features such as streaming results, bulk insert, subscribing to
database changes and using partial document updates. After that, well move
on to part two, indexing.

20 It is also possible for a network problem to cause us to miss a change notication. In


practice, that isnt an issue, because any change notication will force a check for everything,
but it is something to be aware of, in terms of cache consistency.

Chapter 5

Advanced Client API Usage


With a good grounding in RavenDB Concepts, we can turn our mind to a more
detailed exploration of the RavenDB Client API and what we can do with it.
We have already seen how we can use the client API for CRUD and querying
(although well deal with that a lot more in Part II). Now well dive right in
and see how we can get the most out of RavenDB. In particular, in this chapter
well cover how we can get the most by doing the least.

5.1 Lazy is the new fast


The Fallacies of Distributed Computing were already covered in chapter 1, but
it is worth going over a couple of them anyway. This time, the relevant fallacy
is: Latency is zero.
The reason that the Fallacies are so well known is that we keep tripping over
them. Even experienced developers fall into the habit of thinking that their
environment (everything is local) is the production environment, and end up
making a lot of remote calls. But latency isnt zero, and usually, just the cost of
going to the database is much higher than actually executing whatever database
operation we wanted.
The RavenDB Client API deals with this in several ways. First, whenever
possible, we batch calls together. When we call Store on a few entities, we dont
go to the server. Instead, we wait for the call to SaveChanges which then saves
all the changes in one remote call. Second, we have a budget in place on how
many remote calls a single session can do. If your code exceeds this budget, an
exception will be thrown.
Because of this limit, we have a lot of ways in which we can reduce the number
of times we have to go to the server. One such example is the Includes feature,
85

86

CHAPTER 5. ADVANCED CLIENT API USAGE

which tells to RavenDB that well want the associated documents as well as the
one we asked for. But what is probably the most interesting way to reduce the
number of remote calls is to be lazy. Let us rst look at Listing 5.1 and then
well discuss what is going on there.
Listing 5.1: Using Lazy Operations
var l a z y O r d e r = DocumentSession . Advanced . L a z i l y
. I n c l u d e <Order >(x => x . Company )
. Load ( o r d e r s / 1 ) ;
var l a z y P r o d u c t s = DocumentSession . Query<Product >()
. Where ( x=>x . Category == c a t e g o r i e s / 2 )
. Lazily ( ) ;
DocumentSession . Advanced
. Eagerly . ExecuteAllPendingLazyOperations ( ) ;
var o r d e r = l a z y O r d e r . Value ;
var p r o d u c t s = l a z y P r o d u c t s . Value ;
var company = DocumentSession . Load<Company>( o r d e r . Company ) ;
// show o r d e r , company & p r o d u c t s t o t h e u s e r
In Listing 5.1, instead of writing DocumentSession.Include(...).Load (...) , we
used the Advanced.Lazily option, and instead of ending the query with a ToList
we used the Lazily extension method. That much is obvious, but what does
this mean? When we use lazy operations, we arent actually executing the
operation. We merely register that we want that to happen. It is only when
Eagerly.ExecuteAllPendingLazyOperations is called that those operation are
executing, and they all happen in one round trip.
Consider the case of a waiter in a restaurant, when the waiter is taking the order
from a group of people, it is possible for him to go to the kitchen and let the
cook know about a new Fish & Chips1 plate whenever a member in the group
is making an order. But it is far more ecient to wait until every member in
the group has ordered, and then going to the kitchen once.
Eagerly.ExecuteAllPendingLazyOperations
Order.Value

vs.

lazy-

In Listing 5.1, we have used the Eagerly.ExecuteAllPendingLazyOperations


method to force all the lazy values to execute. But that is merely a
1 This part of the book was written while Im hungry, there might be additional food
metaphores.

5.2. UNBOUNDED RESULTS SET PREVENTION

87

convenience method. We could have done the same by just calling


lazyOrder.Value.
At that point, the lazy object would initialize itself, realize that it
hasnt been executed yet, and call Eagerly.ExecuteAllPendingLazyOperations
itself. So the end result is very much the same, but we still prefer
to call Eagerly.ExecuteAllPendingLazyOperations explicitly.
It is easy to not notice that you are using the lazyOrder.Value property and trigger a query. I believe that it is much better when you
have an explicit lazy evaluation boundary, rather than an implicit
one. That is especially true if you are coming into the code several
months or years after you wrote it.
That is exactly what lazy does. Except that it is actually even better. Lazy
batch all the requests until Eagerly.ExecuteAllPendingLazyOperations is called,
and then send them to the server. But on the server side, we unpack the request
and then execute all of the requests in parallel. The idea is that we can reduce
the overall latency by reducing the number of remote calls, and executing the
requests in parallel means that we are done process everything so much faster.
Except for the deferred execution of the lazy operation, it behaves in exactly the
same way. That holds true for options like Include as well, although, of course,
the included document is only loaded into the session when the lazy operation
has completed. Every read operation supported by RavenDB is also available
to be executed as a lazy request.
On the server side, we buer all the responses for the lazy request until all the
inner requests are done, then we send them to the client. That means that if
you want to get back a very large amount of data, you probably dont want to
use lazy, or be prepare for increased memory usage on the server side.
A better approach for dealing with very large requests is streaming, but before
we get there, let us look at another way in which RavenDB is stopping bad
things from happening.

5.2 Unbounded Results Set Prevention


RavenDB was designed from the get go to be safe by default. One of the ways
that this expresses itself is in the internal governors that it has. We have seen
one such governor in the limit on the number of remote calls that a session can
do. Another such governor is the limit on the number of results that will be
returned from the database. Take a look at the following snippet:
var o r d e r s = DocumentSession . Query<Order >()
. ToList ( ) ;

88

CHAPTER 5. ADVANCED CLIENT API USAGE

There are 880 orders in the Northwind database, how many results will this
query return? As you can imagine, the answer is not 880. Code like this snippet
is bad, because it makes assumptions about the size of the data. What would
happen if there were three million orders in the database? Would we really want
to load and materialize them all? This problem is called Unbounded Result Set,
and it is very common in production, because it sneaks up on you. You start
out your system, and everything is fast and ne. Over time, more and more
data is added, and you end up with a system that slows down. In most cases
Ive seen, just after reading all of the orders, we discarded 99.5% of them and
just showed the user the last 15.
With RavenDB, such a thing is not possible. If you dont specify otherwise, all
queries are assumed to have a .Take(128) on them. So the answer to my previous
question is that the snippet above would result in 128 order documents being
returned. Some people are quite upset with this design decision, their argument
boils down to: This code might kill my system, but I want it to behave like I
expect it to. Im not sure why they expect to kill their system, but they can
do that without RavenDBs help2 . That way, hopefully it wouldnt be us that
would get the 2 AM wakeup call and try to resolve what is going on.
Naturally, a developers rst response to hearing about the default .Take(128)
clause is this:
var o r d e r s = DocumentSession . Query<Order >()
. Take ( i n t . MaxValue ) // f i x RavenDB bug
. ToList ( ) ;
This is why we have an additional limit. You can specify a take clause up to
1024 in size. Any value greater than 1024 will be interpreted as 1024. Now,
if you really want, you can change that by specifying the Raven/MaxPageSize
conguration, but we very strongly recommend against that. RavenDB is designed for OLTP scenarios, and there are really very few situations where you
want to read a lot of data to process a users request.
For the situations where you actually do need all the data, the Query API isnt
really a suitable interface for that. That is why we have result streaming in
RavenDB.

5.3 Streaming results


Large result sets should be rare in your applications. RavenDB is designed for
OLTP applications, and there is very little cause for loading tens of thousands
of documents and try showing them all to a user. They just have no way of
processing so much information. It is much better to give them paging, so they
2 You can still shoot yourself in the foot with RavenDB, but we like to think it would take
some eort to do so.

5.3. STREAMING RESULTS

89

can consume the data at a reasonable rate, and reduce the overall load on the
entire system.
But there is one relatively common case where you do need to have access to
the entire dataset: Excel.
More properly, any time that you need to give the user access to a whole lot
of records to be processed oine. Usually you output things in a format that
Excel can understand, so the users can work with the data in a really nice tool.
But reports in general are a very common scenario for this requirement.
So how can we do that? One way to handle that would be to just page, something like the abomination in Listing 5.2:
Listing 5.2: The WRONG way to get all users
p u b l i c L i s t <User> G e t A l l U s e r s ( )
{
// This code i s an example o f how NOT
// t o do t h i n g s , do NOT t r y t o u s e i t
// e v e r . I mean i t !
L i s t <User> a l l U s e r s = new L i s t ( ) ;
int start = 0;
while ( true )
{
u s i n g ( var s e s s i o n = DocumentStoreHolder . OpenSession ( ) )
{
var c u r r e n t = s e s s i o n . Query<User >()
. Take ( 1 0 2 4 )
. Skip ( s t a r t )
. ToList ( ) ;
i f ( c u r r e n t . Count == 0 )
break ;
s t a r t+= c u r r e n t . Count ;
a l l U s e r s . AddRange ( c u r r e n t ) ;
}
}
return allUsers ;
}
I call the code an abomination because it has quite a few problems. To start
with, this code will work just ne on small amount of data, but as the data size
grows, it will do more and more work, and consume more and more resources.
Let us assume that we have a moderate number of users, a 100,000 or so. The
cost of the code in Listing 5.2 is: go to the server 98 times, do deeper paging
on each request, hold the entire 100,000 users in memory.
Note that the code there is also evil because it uses a dierent session per loop

90

CHAPTER 5. ADVANCED CLIENT API USAGE

iteration, preventing RavenDB from detecting the fact that this code is going
to hammer the database with requests. Another problem is when can we start
processing this? What happens is the code buers all the results, so we have to
wait until the entire process is done to start handling this. All of those mean
longer duration, higher memory usage and a lot of waste.
And what happens if someone is adding or deleting documents while this code is
running? We dont make a single request, so it happens in multiple transactions,
so there is also this issue. In short, never write code like this. RavenDB has builtin support for properly handling large number of results, and it is intentionally
modelled to be ecient at that scale. Say hello to streaming. Listing 5.3 is the
same GetAllUsers method, now written properly.
Listing 5.3: The proper way to get all users
p u b l i c IEnumerable<User> G e t A l l U s e r s ( )
{
var a l l U s e r s = DocumentSession . Query<User > ( ) ;
IEnumerator<StreamResult<User>> stream =
DocumentSession . Advanced . Stream ( a l l U s e r s ) ;
w h i l e ( stream . MoveNext ( ) )
{
y i e l d r e t u r n stream . Current . Document ;
}
}
Beside the fact that there is a lot less code here, let us take a look at what this
code does. Instead of sending multiple queries to the server, we are making a
single query. We also indicate that this is a streaming query, which means that
the RavenDB Client API will have a very dierent behavior, and use a dierent
endpoint for processing this query.
Paging and streaming
By default, a stream will fetch as many results as you need (up to
2.1 billion or so), but you can also apply all the normal paging rules
to a stream as well. Just as a .Take(10 * 1000) to the query before
you pass it to the Stream method.
On the client side, the dierences are:
The use of enumerator to immediately expose the results as they stream
in, instead of waiting for all of them to arrive before giving anything back
to your code.
The result of the Stream operation is not tracked by the session. There can
be a lot of results, and tracking them would put a lot of memory pressure
on your system. It is also very rare to call SaveChanges on a session that
is taking part in streaming operation, so we dont lose anything.

5.4. BULK INSERTS

91

The dedicated endpoint on the server side has dierent behavior as well. Obviously, it does not apply the Raven/MaxPageSize limit. But much more importantly, it will stream the results to you without doing any buering. So
the client can start processing the results before the server has nished sending them. Another benet is consistency, as throughout the entire streaming
operation, we are going to be running under a single transaction.
What happens is that we operate on a database snapshot, so any addition,
deletions or modications are just not visible. As far as the streaming operation
is concerned, the database is frozen at the time of the beginning of the streaming
operation.
RavenDB Excel Integration
I mentioned earlier that a very common use case for streaming is the
need to expose the database data to users in Excel format. Because
this is such a common scenario, RavenDB comes with a dedicated
support for that. The output of the Stream operation is usually
JSON, but we can ask RavenDB to output it in CSV format that can
be readily consumed in Excel. You can go to the Excel Integration
documentation page to see the walk through.
The nice thing about that is that you can even update the data from
the database into Excel after the rst import.
That said, please read the Reporting Chapter for fuller discussion
on how to handle reporting with RavenDB. In general just giving
users access to the raw data in your database results in a mess down
the road.
The Stream operation accept a query, or a document id prex, or even just the
latest etag that you have (which allow you to read all documents in update
order). This is the usual way youre going to use to fetch large amount of data
from RavenDB.
But what if I want to go the other way around? What if I want to save a lot of
data into RavenDB? That is why we have the bulk insert operation.

5.4 Bulk inserts


Inserting data to RavenDB is pretty easy, we just have to call Store and then
SaveChanges. But what happens when we want to insert a lot of data? Listing
5.4 shows one such example:
Listing 5.4: Inecently insert new users, rst option
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{

92

CHAPTER 5. ADVANCED CLIENT API USAGE


f o r e a c h ( var u s e r L i n e C s v i n F i l e . ReadLines ( u s e r s . c s v ) )
{
User u s e r = User . FromCsv ( u s e r L i n e C s v ) ;
s e s s i o n . Store ( user ) ;
}
s e s s i o n . SaveChanges ( ) ;

}
Listing 5.4 is good because well only have to go to the database once, right?
We call SaveChanges, and things are saved in an optimal fashion. The answer
to that is a denite maybe. The problem with giving a good answer is that we
are missing a very important piece of information. How big is the users .csv le?
If it is in the range of hundreds to low thousands of users, that is a great way
to handle that. If it is more than that, we have a problem. The session isnt
really meant for bulk operations. Let us assume that the users .csv le contains
50,000 users. That would mean that we would load 50,000 users to memory,
then create a single request with a payload of half a million documents in it,
then send it to the server as a single transaction.
The likely reason is that well get an out of memory exception at some point
along the way. In particular, there is a limit to how much data can be changed
in a single transaction. Admittedly, this limit is in the hundreds of megabytes3 ,
but it is there. A very long write transaction is also something that we would
like to avoid, because it is pretty costly.
A common solution is to use multiple sessions, so well use batch up to 512 users,
then well call SaveChanges, then we create a new session. The downside of this
approach is that we still have relatively large requests, and not we have a lot of
them.
Trying this option with 50,000 users, we get the code in Listing 5.5. This is
quick & dirty code, written merely to show a point.
Listing 5.5: Inecently insert new users, second option
i n t amount = 0 ;
var s e s s i o n = documentStore . OpenSession ( ) ;
f o r e a c h ( var i i n Enumerable . Range ( 0 , 5 0 * 1 0 0 0 ) )
{
User u s e r = new User
{
Name = H e l l o +i
};
s e s s i o n . Store ( user ) ;
i f ( amount++ > 5 1 2 )
{
3 This limit is controlled via the Raven/Esent/MaxVerPages option in Esent, and by the
Raven/Voron/MaxScratchBuerSize option in Voron.

5.4. BULK INSERTS

93

amount = 0 ;
s e s s i o n . SaveChanges ( ) ;
s e s s i o n . Dispose ( ) ;
s e s s i o n = documentStore . OpenSession ( ) ;
}
}
s e s s i o n . SaveChanges ( ) ;
s e s s i o n . Dispose ( ) ;
This code runs in 21.65 seconds, or a rate of 2,310 documents/second. It is not
great, but it isnt bad either. Trying with a batch size of 1024 resulted in run
time of just over 25 seconds and a batch size of 256 took 26 seconds, so for this
work load, 512 seems to be working ne.
Such tasks arent frequent in RavenDB. It isnt often that you need to put
so much data into the database. However, we have to deal with yet another
common case. The nightly ETL process. Every night we get a le from some
other system that we need to load into our system. That means that we want
to be able to give you a good solution for this issue.
Just telling you to do batched SaveChanges isnt enough. Hence, the need for
bulk insert. Bulk insert operates in the exact opposite manner than streaming.
Let us see the code in Listing 5.6 and then well discuss what is going on.
Listing 5.6: Ecently insert new users using bulk insert
u s i n g ( var b u l k I n s e r t = documentStore . B u l k I n s e r t ( ) )
{
f o r e a c h ( var i i n Enumerable . Range ( 0 , 50 * 1 0 0 0 ) )
{
User u s e r = new User
{
Name = H e l l o + i
};
bulkInsert . Store ( user ) ;
}
}
As you can see, we dont do batching in Listing 5.5. We create a single BulkInsert
and use that. What actually happens is that the BulkInsert is using a single
long request to talk to the server. Whenever bulkInsert .Store is called, we
are sending that data to the server immediately. On the server side, it is also
processed immediately, instead of having to wait for everything to get to the
server.
Bulk Insert Batches
Internally, we dont actually send each document over the network
independently. We have a batch size & time limit. We batch all the

94

CHAPTER 5. ADVANCED CLIENT API USAGE


documents that we get up to the batch size (whose default is 512)
or until we have 200 ms without a new document being stored.
At that point, we compress all the documents we have batched, and
send all of them together to the server. On the server side, we read
this batch, and we process all the documents in the batch as a single
transaction. As a result of this, you dont have very big transactions,
just a lot of small internal transactions.
It is important to understand that if the bulk insert fails midway,
all the previous batches have already been committed, but all the
documents in the current batch will be rolled back. In practice, this
means that if the bulk insert failed, you dont know what data was
committed and what wasnt.

More than anything else, the fact that we can parallelize client and server work
means that we get a really good performance. It doesnt hurt that the code
path that BulkInsert is using is also highly optimized for inserts, as you can
imagine.
BulkInsert can also accept an options argument:
documentStore . B u l k I n s e r t ( o p t i o n s : new B u l k I n s e r t O p t i o n s ( ) ) ;
Those options include:
BatchSize - For how many documents should we wait before sending a
batch to the server. The default is 512.
OverwriteExisting - If set to true, allows RavenDB to overwrite an existing
document, otherwise, an error is thrown if the inserted document already
exists. Default is false, setting this to true will reduce the insert speed.
CheckReferencesInIndexes - Whatever document references (resulting
from LoadDocument) need to be checked. Well discuss this feature in
detail in the Part II. Default is false, setting this to true will reduce the
insert speed.
WriteTimeoutMilliseconds - How much time we can wait for the full queue
to clear. Default is 15,000 ms (15 seconds).
The last options deserve some additional explanation. One issue with bulk insert
is that the client side can usually generate the documents far faster than the
server can receive and store them. If the number of documents is high, it is
possible that the number of documents waiting to be sent to the server will be
very high. Eventually, youll run out of memory.
Because of that, the BatchSize option also controls how many documents we
can have waiting to be sent. The queued documents will be up to 150% of the
size of the BatchSize. Assuming the BatchSize is set to 512, then the maximum
number of pending documents will be 768. At this point, if you try to Store an

5.5. PARTIAL DOCUMENT UPDATES

95

additional document, youll be blocked until another batch has been processed
and space on the queue is freed.
To prevent a situation where youre blocked for a very long time, we use the
WriteTimeoutMilliseconds value to make sure that a timeout exception is thrown
if we are waiting for too long.
At the end of the day, inserting 50,000 documents with BulkInsert took 11.33
seconds, for a rate of 4,415 documents/second. Or about twice as fast as the
alternative.
Benchmark & lies
A note about the performance numbers. Im not trying to do a
benchmark here for aboslutely the best performance. Im actually
running this with a debug build of RavenDB on both the client
& server, while the system is also running integration tests in the
background, on a laptop, while riding the train. Dont trust those
numbers!
We can get to 15,000 - 20,000 sustained writes per second on standard server hardware, using a release build, and we can go beyond
that by conguring the database settings properly (well discuss
those option in Part IV - Operations).
What Im trying to do is to give you a sense of the relative performance dierences between the session and bulk insert. And in
general, bulk insert in going to be between twice to ten times as fast
as using SaveChanges and batching.
The last, but certainly not least, important thing about BulkInsert. If you look
at Listing 5.6, you can see that it is wrapped in a using statement. This is
important, the BulkInsert is using the Dispose call to ush the remaining data
to the server, close the connection and in general clean after itself. It is only
after the Dispose has completed that you can be certain that all the data that
was bulk inserted is actually safe inside the database.
So far we talked about the big stu. How we can read a lot of data and write a
lot of data. Now I want to turn in the complete opposite direction and go into
the very small. How can I update just a part of a document?

5.5 Partial document updates


For the most part, when working with RavenDB youll be working with full
documents. That means that youll load a document, modify it, and save it
back to the database as a single unit. That matches closely for the idea that
documents are a unit of change.

96

CHAPTER 5. ADVANCED CLIENT API USAGE

But there are good reasons4 why youll want to do just a partial document
update. There are two common reasons to want to do that:
You have a piece of data that have a good legitimate reason to be change
concurrently.
You want to save the cost of loading the full document and saving the full
document.
Both reasons are somewhat problematic. Because a document is a unit of change,
there should be only a single reason to change it, and all updates to a document
are serialized. Reasons for wanting to change a document concurrently should
be rare. One such example might be adding a comment to a blog post. Because
there are no associations between comments, it is valid to have two comments
being added to the same blog post at the same time.
Wanting to save the cost of loading and saving the full document is a warning
sign. That usually points to a problem in the way you are structuring your
documents. Well discuss this further in the Modeling Chapter. For now, it is
important to note that regardless of how you wish to modify a document (full
update or patching), on the server side, the eect is always replacing the whole
object.
With the cautionary words out of the way, the main advantages of patching
are that we can handle concurrency in a more granular fashion and that we
are generally sending (a lot) less data over the wire. Usually, if we have two
concurrent modications to the same document, that would generate a ConcurrencyException on one side. That is because we dont know which version
should win.
With patching, we dont have the new version of the document; we have a
description of the change we want to make to the document. And now let us
see how we can actually execute partial updates.

5.5.1 Simple Patch API


RavenDB has two possible patching options. Those are called the Simple Patch
and Scripted Patch. The simple patch API is fairly limited in what it can do.
It has operations for:

Set a property
Unset (remove) a property
Add an item to an array
Insert an item to an array at a specied location

4 While there are good reasons for that, they are also pretty rare situations. For the most
part, patching should be the exception

5.5. PARTIAL DOCUMENT UPDATES

97

Remove an item from an array at a specied location


Increment a property
Rename a property
Copy a propertys value to another property
Modify a nested value by using any of the supported simple patch operations

Listing 5.7 shows an example of using the simple patch API to reduce the level
of product in stock.
Listing 5.7: Using simple patching to decrement products in stock value
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
s e s s i o n . Advanced . D e f e r ( new PatchCommandData
{
Key = p r o d u c t s / 1 ,
Patches = new [ ]
{
new PatchRequest
{
Name = U n i t s I n S t o c k ,
Type = PatchCommandType . Inc ,
Value = 1
},
}
});
s e s s i o n . SaveChanges ( ) ;
}
Additional options that you can set on the PatchRequest include what to do if
the document doesnt exists (well run a dierent set of patch commands located
in the PatchIfMissing property). Or we can specify that well only change a
property value if its value match the PrevVal property, etc.
session.Advanced.Defer
The Advanced.Defer method allow you to register a low level command (such as PatchCommandData) to be carried out when the
sessions SaveChanges is called.
This allows you to add commands to the same transaction that will
occur when SaveChanges happen along with all the other changes
in the session.
To be perfectly frank, simple patching is hard, and it isnt really nice to use or
very exible. Weve added a lot of options to it over the years. But fundamentally it remained not very friendly. In general we recommend that youll avoid
using this in favor of Scripted Patching.

98

CHAPTER 5. ADVANCED CLIENT API USAGE

5.5.2 Scripted patching


RavenDB is a database for working with JSON documents. What can be more
natural to work with those documents than JavaScript? And that is exactly
what the Scripted Patching is providing. It allows you to write a js script and
run it against a document. Listing 5.8 replicates the same behavior we have
seen with simple patching in Listing 5.7.
Listing 5.8: Using scripted patching to decrement products in stock value
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
s e s s i o n . Advanced . D e f e r ( new ScriptedPatchCommandData
{
Key = p r o d u c t s / 1 ,
Patch = new S c r i p t e d P a t c h R e q u e s t
{
S c r i p t = t h i s . U n i t s I n S t o c k ;
}
});
s e s s i o n . SaveChanges ( ) ;
}
There isnt much to it, is there? But at the same time, this gives your tremendous power. A more interesting script can be to split the rst and last name into
separate properties. We can do that using the following script (the surrounding
code has been omitted for claritys sake).
var p a r t s = t h i s . Name . s p l i t ( ) ;
t h i s . FirstName = p a r t s [ 0 ] ;
t h i s . LastName = p a r t s [ 1 ] ;
d e l e t e t h i s . Name ;
You can also use if, while and the full power of JavaScript. In addition to the
basic JavaScript library of methods, you also have available the lodash.js library.
Here is how we can create some random data using the lodash.js functions:
t h i s . Age = _. random ( 1 7 , 6 2 ) ;
You can read more about the lodash.js method on the online documentation.
For now, suce to say that you have the ability to transform your documents,
apply logic and behavior on a very granular basis. However, you should be
aware that while you can do a lot of stu using patching, it is still advisable to
reserve that for cases where you have no other choice.
It is usually better to work directly with the documents, and not to mess around
with patching. Trying to do too much with patching means that youll deal with
a lot of scripts, and lose a lot of the nicer abilities that RavenDB gives you (client
side type safety, single reason to change, etc.).

5.5. PARTIAL DOCUMENT UPDATES

99

Limitations
Because scripted patches are run on the server side, we need to be
cautious about their use. RavenDB will ag and kill any script that
is obviously abusive (trying to create stack overow, innite loop,
etc.). By default, a script is limited to about 10,000 operations,
after which time it is killed.
However, especially with large documents or complex scripts, it can
take a while for RavenDB to execute a script. RavenDB is doing
quite a lot to optimize script execution, including caching the parsed
scripts, but it is still requiring us to evaluate the scripts, and that
has a non-trivial cost.

5.5.2.1 Parameters
Frequently, you need to customize your script to allow for dierent options. For
example, if we look at Listing 5.8, we can see that we decrement the units in
stock by one. However, what would happen if we wanted to decrement the units
in stock by 7, or 5?
One wrong way of doing that would be the following:
S c r i p t = t h i s . U n i t s I n S t o c k = + amountToDecrement + ; ;
Allow me to count the number of ways this is wrong. Just like building SQL
strings using concatenations, this is wrong. It produces hard to read code, make
it much harder to cache the scripts and introduces the possibility of the user
input injection.
Instead, just like SQL again, we have a much better option, using parameters.
See Listing 5.9.
Listing 5.9: Using scripted patching with parameters
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
s e s s i o n . Advanced . D e f e r ( new ScriptedPatchCommandData
{
Key = p r o d u c t s / 1 ,
Patch = new S c r i p t e d P a t c h R e q u e s t
{
S c r i p t = t h i s . U n i t s I n S t o c k = amountToDecrement ; ,
Values = {{ amountToDecrement , 7}}
}
});
s e s s i o n . SaveChanges ( ) ;
}

100

CHAPTER 5. ADVANCED CLIENT API USAGE

You can see that we pass a variable amountToDecrement to the script. The
advantage of a variable is that you dont have to worry about input injection,
you dont have to build the script by string concatenation and you can fully
cache the parsed script and reuse it many times.
5.5.2.2 Accessing other documents
One of the really nice things about the scripted patching API is that it not only
give you full access to the current document (exposed as the this object), but
it also gives you full access to other documents. You can call LoadDocument to
load another document and use its values to modify your document, as you can
see in Listing 5.10. Or you can call DeleteDocument to remove a document, or
even call PutDocument to create or update another document.
Listing 5.10: Loading a related document during patching
u s i n g ( var s e s s i o n = documentStore . OpenSession ( ) )
{
s e s s i o n . Advanced . D e f e r ( new ScriptedPatchCommandData
{
Key = p r o d u c t s / 1 ,
Patch = new S c r i p t e d P a t c h R e q u e s t
{
S c r i p t = @
var c a t e g o r y = LoadDocument ( t h i s . Category ) ;
t h i s . CategoryName = c a t e g o r y . Name ;
,
}
});
s e s s i o n . SaveChanges ( ) ;
}
You can read the full details about the DeleteDocument and PutDocument
methods in the online documentation, since they are far less used than
LoadDocument.

5.5.3 Concurrency
What will happen if you have to patch requests to the same document at the
same time? Because they arent full document update, it is actually possible to
make this work. What will happen is that the RavenDB engine will serialize all
those patch requests to the document.
Then it will execute them one at a time. Since they represent changes to the
document, which is safe to do, so that one of the main use cases for patching
is to concurrently update documents. That said, note that if you have a lot of

5.6. LISTENERS

101

patch requests to the same document, eventually the internal queue RavenDB
uses will be full and patch requests for this document will be rejected. If there
are other operations in the same transaction, they will also fail as a single unit.
And with that, let us zoom out from the partial document updates to a far
bigger scope. How to apply application wide behaviors using RavenDB using
listeners?

5.6 Listeners
It is pretty common to want to run some code whenever something happens in
RavenDB. The classic example is when you want to store some audit information
about who modied a document. In the previous section, we saw that we can
do that manually, but that is both tedious and prone to errors or omissions. It
would be much better if we could do it in a single place.
That is why the RavenDB Client API has the notion of listeners. Listeners allow
you to dene, in a single place, additional behavior that RavenDB will execute
at particular points in time. RavenDB has the following listeners:

IDocumentStoreListener - called when an entity is stored on the server.


IDocumentDeleteListener - called when a document is being deleted.
IDocumentQueryListener - called before a query is made to the server.
IDocumentConversionListener - called when converting an entity to a document and vice versa.
IDocumentConictListener - called when a replication conicted is encountered, this listener is discussed in depth in Chapter 10, Replication.
The store and delete listeners are pretty obvious. They are called whenever a
document is stored (which can be a new document or an updated to an existing
one) or when the document is deleted. A common use case for the store listener
is as an audit listener, which can record which user last touched a document.
A delete listener can be used to prevent deletion of a document based on your
business logic, and a query listener can modify any query issued.
You can see examples of all three in Listing 5.11.
Listing 5.11: Store, Delete and Query listeners
public c l a s s AuditStoreListener : IDocumentStoreListener
{
p u b l i c b o o l B e f o r e S t o r e ( s t r i n g key ,
o b j e c t e n t i t y I n s t a n c e , RavenJObject metadata ,
RavenJObject o r i g i n a l )
{
metadata [ LastModified By ] = WindowsIdentity

102

CHAPTER 5. ADVANCED CLIENT API USAGE


. GetCurrent ( ) . Name ;
return f a l s e ;
}
p u b l i c v o i d A f t e r S t o r e ( s t r i n g key ,
o b j e c t e n t i t y I n s t a n c e , RavenJObject metadata )
{
}

}
public class PreventActiveUserDeleteListener :
IDocumentDeleteListener
{
p u b l i c v o i d B e f o r e D e l e t e ( s t r i n g key ,
o b j e c t e n t i t y I n s t a n c e , RavenJObject metadata )
{
var u s e r = e n t i t y I n s t a n c e a s User ;
i f ( u s e r == n u l l )
return ;
i f ( user . IsActive )
throw new I n v a l i d O p e r a t i o n E x c e p t i o n (
Cannot d e l e t e a c t i v e u s e r : +
u s e r . Name ) ;
}
}
public c l a s s OnlyActiveUsersQueryListener :
IDocumentQueryListener
{
p u b l i c v o i d BeforeQueryExecuted (
IDocumentQueryCustomization q u e r y C u s t o m i z a t i o n )
{
var userQuery = q u e r y C u s t o m i z a t i o n a s
IDocumentQuery<User >;
i f ( userQuery == n u l l )
return ;
userQuery . AndAlso ( ) . WhereEquals ( I s A c t i v e , t r u e ) ;
}
}

In the AuditStoreListener, we modify the metadata to include the current user


name. Note that we return false from the BeforeStore method as an indication
that we didnt change the entityInstance parameter. This is an optimization
step, so we wont be forced to re-serialize the entityInstance if it wasnt changed
by the listener.

5.6. LISTENERS

103

In the PreventActiveUserDeleteListener case, we throw if an active user is being deleted. This is very straightforward and easy to follow. It is the case
of OnlyActiveUsersQueryListener that is interesting. Here we check if we are
querying on users (by checking if the query to customize is an instance of
IDocumentQuery<User>) and if it is, we also add a lter on active users only.
In this manner, we can ensure that all user queries will operate only on active
users.
We register the listeners on the document store during the initialization.
Listing 5.12 shows the updated CreateDocumentStore method on the
DocumentStoreHolder class.
Listing 5.12: Registering listeners in the document store
p r i v a t e s t a t i c IDocumentStore CreateDocumentStore ( )
{
var documentStore = new DocumentStore
{
Url = h t t p : / / l o c a l h o s t : 8 0 8 0 ,
D e f a u l t D a t a b a s e = Northwind ,
};
documentStore . R e g i s t e r L i s t e n e r (
new A u d i t S t o r e L i s t e n e r ( ) ) ;
documentStore . R e g i s t e r L i s t e n e r (
new P r e v e n t A c t i v e U s e r D e l e t e L i s t e n e r ( ) ) ;
documentStore . R e g i s t e r L i s t e n e r (
new OnlyQueryActiveUsers ( ) ) ;
documentStore . I n i t i a l i z e ( ) ;
r e t u r n documentStore ;
}
Once registered, the listeners are active and will be called whenever their respected actions occur.
The IDocumentConversionListener allows you a ne grained control over the
process of the conversion process of entities to documents and vice versa. If you
need to pull data from an additional system when a document is loaded, this is
usually the place where youll put it5 .
A far more common scenario for conversion listener is to handle versioning,
whereby you modify the old version of the document to match an update entity
denition on the y. This is a way for you to do rolling migrations, without an
expensive stop-the-world step along the way.
5 That said, pulling data from secondary sources on document load is frowned upon, documents are coherent and independent. You shouldnt require additional data, and that is
usually a performance problem

104

CHAPTER 5. ADVANCED CLIENT API USAGE

While the document conversion listener is a great aid in controlling the conversion process, if all you care about is the actual serialization, without the need
to run your own logic, it is probably best to go directly to the serializer and use
that.

5.7 The Serialization Process


RavenDB uses the Newtonsoft.JSON library for serialization. This is a very
rich library with quite a lot of options and levers that you can tweak. Because of version incompatibilities between RavenDB and other libraries that
also has a dependeny on Newtonsoft.JSON, RavenDB has internalized the Newtonsoft.JSON library. To access the RavenDB copy of Newtonsoft.JSON, you
need to use the following namespace: Raven.Imports.Newtonsoft.Json.
Newtonsoft.JSON has several options for customizing the serialization process.
One of those is a set of attributes (JsonObjectAttribute, JsonPropertyAttribute,
etc). Because RavenDB has its own copy, it is possible to have two sets of such
attributes: one for serialization of the entity to a document in RavenDB, and
another for serialization of the document for external consumption.
Another method of customizing the serialization in Newtonsoft.JSON is using
the documentStore.Conventions.CustomizeJsonSerializer event. Whenever a serializer is created by RavenDB, this event is called and allows you to dene the
serializers settings. You can see an example of that in Listing 5.13.
Listing 5.13: Customizing the serialization of money
DocumentStoreHolder . S t o r e . Conventions
. C u s t o m i z e J s o n S e r i a l i z e r += s e r i a l i z e r =>
{
s e r i a l i z e r . C o n v e r t e r s . Add( new JsonMoneyConverter ( ) ) ;
};
p u b l i c c l a s s JsonMoneyConverter : J s o n C o n v e r t e r
{
p u b l i c o v e r r i d e v o i d WriteJson ( J s o n W r i t e r w r i t e r ,
o b j e c t valu e , J s o n S e r i a l i z e r s e r i a l i z e r )
{
var money = ( Money ) v a l u e ;
w r i t e r . WriteValue ( money . Amount + + money . Currency ) ;
}
p u b l i c o v e r r i d e o b j e c t ReadJson ( JsonReader r e a d e r ,
Type objectType , o b j e c t e x i s t i n g V a l u e ,
JsonSerializer serializer )
{

5.8. CHANGES() API

105

var p a r t s = r e a d e r . ReadAsString ( ) . S p l i t ( ) ;
r e t u r n new Money
{
Amount = d e c i m a l . Parse ( p a r t s [ 0 ] ) ,
Currency = p a r t s [ 1 ]
};
}
p u b l i c o v e r r i d e b o o l CanConvert ( Type o b j e c t T y p e )
{
r e t u r n o b j e c t T y p e == t y p e o f ( Money ) ;
}
}
p u b l i c c l a s s Money
{
p u b l i c s t r i n g Currency { g e t ; s e t ; }
p u b l i c d e c i m a l Amount { g e t ; s e t ; }
}
The idea in Listing 5.13 is to have a Money object that holds both the amount
and the currency, but to serialize it to JSON as a string property. So a Money
object representing 10 US Dollars would be serialized to the following string:
10 USD.
The JsonMoneyConverter converts to and from the string representation and
the json serializer customization event registers the converter with the serializer.
Note that this is probably not a good idea, and you will want to store the Money
without modications, so you can do things like sum up order by currency, or
actually work with the data.
I would only consider using this approach as an intermediary step, probably as
part of a migration if I had two versions of the application working concurrently
on the same database.

5.8 Changes() API


Assume that we have a user busy working on a document. And that process
can take a while. In the meanwhile, another user came in and changes that
document. What would be the experience from the point of the user? Well,
either we let the Last Write Win, or we use Optimistic Concurrency. Either
way were going to have to annoy someone. How about being able to notify
the user, as soon as the document has updated, that it needs to be refreshed?
But that would require us to check with the server periodically to check if the
document has changed, and that isnt a nice solution to have.

106

CHAPTER 5. ADVANCED CLIENT API USAGE

Polling is wasteful, most of the time you spend a lot of time asking the same
question and expecting to get the same answer. Anyone who had to deal with
are we there yet? and are we there yet, now? knows how annoying that can
be. Setting aside that, we have to consider load and latency factors as well. In
short, we dont want to do polling. So it is good that we dont have to. Listing
5.14 shows us how, note that this code uses the Reactive Extensions package
( Install Package RxCore using NuGet).
Listing 5.14: Registering for Changes() notications

var s t o p S u b s c r i p t i o n = documentStore . Changes ( )


. ForDocument ( p r o d u c t s / 2 )
. S u b s c r i b e ( n o t i f i c a t i o n =>
{
s t r i n g msg = n o t i f i c a t i o n . Type + on document + n o t i f i c a t i o n . I
C on s o le . WriteLine ( msg ) ;
});
// change p r o d u c t s /2
Co n s ol e . ReadLine ( ) ;
// s t o p g e t t i n g n o t i f i c a t i o n s f o r p r o d u c t s /2
stopSubscription . Dispose ( ) ;

Using the code in Listing 5.14, we registered for all changes (Put, Delete) for
the products/2 document. We can take actions, such as notify the user (using
SignalR if we are running in web application, for example). You can register for
notications on specic documents, all documents with a specic prex or of a
specic collection, for all documents changes or for updates to indexes.
Notice that the change notication include the document (or index) id and the
type of the operation performed. Put or Delete in the case of documents, most
often. If you want to actually access the document in question, youll need to
load it using a session.
Another important issue when dealing with notications: once subscribed, youll
continue to get notications for the subscriptions until you have disposed the
subscription (the last line in Listing 5.14). And obviously, your subscription is
going to be called on another thread, so you need to be aware that you might
need to marshal your action to the appropriate location.
And this is pretty much it for Changes(), it is a powerful feature, and it enable
a whole host of interesting scenarios. But from external point of view, it is also
drop dead simple to work with and use. And with that in mind, let us look at
another such feature, result transformers and what you can do with them.

5.9. RESULT TRANSFORMERS

107

5.9 Result transformers


Result transformers are server side transformations that allow us to project
specic data to the client. That is a very impressive statement, but what does
this mean?
Let us take a simple example: we want to show a few details about an order,
such as the orders date, the company name and its id.
We can do that using the following code:
var o r d e r = s e s s i o n . I n c l u d e <Order >(x => x . Company )
. Load ( o r d e r I d ) ;
var company = s e s s i o n . Load<Company>( o r d e r . Company ) ;
C o n s o l e . WriteLine ( { 0 } \ t {1}\ t { 2 } , o r d e r . Id , company . Name , o r d e r . OrderedAt ) ;
This is pretty simple, and because we used an Include, we only have to go to
the server once. But note that we have to load the entire Order and Company
documents to run this code, even though we only want a few properties. We
had to send over a kilobyte of data over the wire, while we actually show the
user a lot less data.
Creating transformers on the server
Because result transformers are server side artifacts, you need to
create them on the server before they can be used. You can do that
manually at application startup using the following code (to create
the transformer shown in Listing 5.15):
new JustOrderIdAndcompany().Execute(documentStore);
Or (as youll see in the Chapter 6 - Indexing) using the
IndexCreation.CreateIndexes to create all indexes and transformers, which is the more commonly used method. It is safe to call
either method when the transformer already exists, in those cases,
if the transformer has changed it will be updated, if the transformer
hasnt changed, this will result in a no-op.
The Northwind dataset is pretty small, and this is somewhat of an extreme
example, but the principle holds. We dont have innite bandwidth, after all6 .
So we dont want to pay this price. How can we get just the information that
we want? Listing 5.15 shows a simple result transformer.

Listing 5.15: Simple result transformer and its usage


p u b l i c c l a s s JustOrderIdAndcompany : A b s t r a c t T r a n s f o r m e r C r e a t i o n T a s k <Order> // todo
{
6 Remember

the fallacies of distributed computing!

108

CHAPTER 5. ADVANCED CLIENT API USAGE


public c l a s s Result
{
p u b l i c s t r i n g Id { g e t ; s e t ; }
p u b l i c s t r i n g Company { g e t ; s e t ; }
p u b l i c DateTime OrderedAt { g e t ; s e t ; }
}
p u b l i c JustOrderIdAndcompany ( )
{
T r a n s f o r m R e s u l t s = o r d e r s =>
from o r d e r i n o r d e r s
s e l e c t new { o r d e r . Id , o r d e r . Company , o r d e r . OrderedAt } ;
}

}
// u s a g e

var o r d e r = s e s s i o n . Load<JustOrderIdAndcompany , JustOrderIdAndcompany . R e s


var company = s e s s i o n . Load<Company>( o r d e r . Company ) ;

Co n s ol e . WriteLine ( { 0 } \ t {1}\ t { 2 } , o r d e r . Id , company . Name , o r d e r . OrderedA


Using the JustOrderIdAndcompany transformer, we were able to load just those
specic properties from the document, saving the cost of loading the entire
document. However, we are now no longer loading the associated company
document, so we have to do two calls to the server to get all the information we
need. Having to go to the server twice is a bummer, but there is a reason we
dont include the company document. We have a far better option available for
us, LoadDocument, which well discuss in the next section.

5.9.1 Load Document


Result transformers are pretty cool, being able to select which properties to
project can save a lot in bandwidth. But what makes them a truly awesome
feature is the ability to reference other documents. Listing 5.16 shows a much
better version of the transformer.
Listing 5.16: Projecting data from multiple documents in a result transformer

p u b l i c c l a s s JustOrderIdAndcompanyName : A b s t r a c t T r a n s f o r m e r C r e a t i o n T a s k <
{
public c l a s s Result
{
p u b l i c s t r i n g Id { g e t ; s e t ; }
p u b l i c s t r i n g CompanyName { g e t ; s e t ; }
p u b l i c DateTime OrderedAt { g e t ; s e t ; }

5.10. SUMMARY

109

}
p u b l i c JustOrderIdAndcompanyName ( )
{
T r a n s f o r m R e s u l t s = o r d e r s =>
from o r d e r i n o r d e r s
l e t company = LoadDocument<Company>( o r d e r . Company )
s e l e c t new { o r d e r . Id , CompanyName = company . Name , o r d e r . OrderedAt } ;
}
}

// u s a g e
var o r d e r = s e s s i o n . Load<JustOrderIdAndcompanyName , JustOrderIdAndcompanyName . R e s u
C o n s o l e . WriteLine ( { 0 } \ t {1}\ t { 2 } , o r d e r . Id , o r d e r . CompanyName , o r d e r . OrderedAt ) ;
Notice what we are doing in the JustOrderIdAndcompanyName transformer.
We are using the LoadDocument method inside the transformer to load the
associated company document, then we project just the company name out to
the client.
This feature gives us complete control over how to shape the data from the
server. And being able to pull data from associated documents is very powerful.
Just to complete this discussion, the over the wire cost is less than 300 bytes.
And in real world situations, the actual saving is far more impressive.
You arent limited to just one document to be loaded. We could have gotten
the employees name for this order, in much the same way we go the companys
name.
This is actually just a taste of what you can do with transformers. Well run
into them again when we discuss indexing, which is where result transformers
really shine.

5.10 Summary
We covered a lot of ground in this chapter. We started by talking about lazy
loading and how we can use it to reduce the number of remote calls we are
making, then diverged into talking about Safe by Default and how we have a
budget that limits the number of remote calls we can make. Hence, the numerous
ways that we have in the RavenDB Client API to reduce the number of remote
calls you make (and along the way, improve your applications perfromance).
Next, and along the same lines, we covered another Safe by Default topic, preventing Unbounded Result Sets. RavenDB does that by specifying a limit (of

110

CHAPTER 5. ADVANCED CLIENT API USAGE

128 results) to the number of results a query will return, if you dont specify
such a limit yourself, and by enforcing a maximum upper limit (of 1,024 results,
by default) for the total number of results that can be specied.
But whenever there is a limit, there is also a way to avoid it. And no, that isnt
by using .Take(int.MaxValue). You can get the entire result set, regardless of
size, when you using the Streaming API. This API is meant to deal with large
number of results, and it will stream results on both client and server so you
can parallelize the work.
The other side of getting a very large result set from the database is actually
inserting a lot of data to the database. And we looked at doing that using
SaveChanges and then using the dedicated API for that, BulkInsert. From the
very big, we moved to the very small, looking at how we can do partial document
updates using patching. We looked at simple patching and scripted patching,
what we can do with them and how they work.
And from updating just part of a document, we moved to handling cross cutting
concerns via listeners. More specically, how you can use listeners to provide
application wide behavior from a single location. For example, handling auditing
metadata in a listener means that you never have to worry about forgetting
to write the audit code. We also learned about the serialization process and
how you can have ne grained control over everything that goes on during
serialization and deserialization.
Closing this chapter, we look at the Changes() API, which allows us to register
for notications from RavenDB whenever a document changes (as well as a
host of other stu) and result transformers, which allow us to run a server side
transformation of the data before it is sent to the client.
Wow, that was a lot of stu to go through. This concludes Part I, which gives
you the basic tools of how to use RavenDB. Next, we are going to start talking
about indexing, and all the exciting things that you can do with them. Go get
some coee: youre going to want to be awake for what is coming.

Part II
In this part, well learn about indexing and querying:
Deep dive into RavenDB Indexing implementation
Ad-hoc queries, automatic indexes and the query optimizer
Why do we need indexes?
Dynamic & static indexes
Simple (map only) indexes
Full text searching, highlights and suggestions
Multi Map indexes
Map/Reduce Indexes
Spatial queries
Facets and Dynamic aggregation
Advanced querying options

111

112

Chapter 6

Inside the RavenDB


indexing implementation
In this chapter, we are going to go over a lot of the theoretical details and
reasoning behind how RavenDB indexes work. Youll not actually learn how to
use the indexes in this query, but youll learn all about their details. You can
feel free to skip this chapter for now, and go straight ahead to the next one to
read the practical details on using indexes. But come back here and read this
chapter at your leisure, it contains a lot of very important information about
how RavenDB operates internally.
We have already done quite a bit with RavenDB, but we havent talked about
indexing at all and very little about querying. That doesnt mean that we didnt
use indexes, however. Let us consider the following query:
var r e c e n t O r d e r s Q u e r y =
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Company == companies /1
o r d e r b y o r d e r . OrderedAt d e s c e n d i n g
s e l e c t order ;
var r e c e n t O r d e r s = r e c e n t O r d e r s Q u e r y . Take ( 3 ) . ToList ( ) ;
How does a query like that work, on the server side? If you are used to relational
databases, you might assume that the following pseudo code is run:
var r e s u l t s = new L i s t <Order > ( ) ;
f o r e a c h ( var o r d e r i n GetAllDocumentsFor ( Orders ) ) {
i f ( o r d e r . Company == companies / 1 )
r e s u l t s . Add( o r d e r ) ;
}
r e s u l t s . S o r t ( ( x , y ) => y . OrderedAt . CompareTo ( x . OrderedAt ) ) ;
113

114CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


This type of operation is called a table scan, and it is quite frequent in relational
databases. This is also quite ecient, as long as the number of items you have in
the database is very small. It tends to fail pretty horribly the moment your data
size reach any signicant size. Ive run into variants of this issues at customers
over and over again, and when the time came to designing RavenDB, I decided
that as part of the Safe By Default culture, we will simply not support this
problem.
RavenDB does no table scans! In fact, there are no O(N) operations in general
in RavenDB queries. Given the title of this chapter, Im sure that you can guess
how we handle queries. We use indexes to speed up queries. Using an index
turn a query from an O(N) operation to an O(logN) operation. For those of you
who dont care about abstract computer science stu, the dierence is between
waiting 30 minutes for a result and getting it right away.
We havent created any indexes, but we can query!
Yes, that is confusing, isnt it? RavenDB doesnt allow queries without an index to answer the question. Yet at the same time, it does
support queries without rst dening an index.
The answer is very simple, RavenDB is a pretty smart beast, whenever you make a query, the query optimizer get a chance to look at
that, and it select the appropriate index to use. But what happens
when we dont have such an index? Well, you obviously want to
query that information, otherwise you wont have send the database
this query. What to do
The query optimizer at this point can gure out what you need to be
index, and it will create this index for you. The details of this process
are explained later in the chapter, but the key part to understand is
that RavenDB can automatically optimize itself to answer the kind
of queries that you execute. This happens on the y and without
requiring any human involvement. The more you use RavenDB, the
smart it become and the faster it is in responding to requests.
But we are jumping ahead of ourselves here, well discuss the ad how querying
optimization RavenDB does in the dynamic indexing section. Before we get
there, let us look at a few standard RavenDB indexes rst.

6.1 How indexing works in RavenDB?


An index is a way for the database to organize information about your data so
it will able to retrieve said data eciently. Let us look at an index denition in
RavenDB. In this case, we want to index the Name property of a Product, so
we can search for a product by name. Listing 6.1 shows a simple index that will
allow us to answer such a query.

6.1. HOW INDEXING WORKS IN RAVENDB?

115

Listing 6.1: Index denition for searching products by name


from p r o d u c t i n d o c s . Products
s e l e c t new
{
p r o d u c t . Name
}
This doesnt look very much like an index, does it? It looks a LINQ query. In
fact, if we were to execute this query well get the following results:
Alice Mutton

Chocolade

Aniseed Syrup

Cte de Blaye

Boston Crab Meat

Escargots de Bourgogne

Camembert Pierrot

Filo Mix

Carnarvon Tigers

Flotemysost

Chai

Geitost

Chang

Genen Shouyu

Chartreuse verte

Gnocchi di nonna Alice

Chef Antons Cajun Seasoning

Gorgonzola Telino

Chef Antons Gumbo Mix

Grandmas Boysenberry Spread

Table 6.1: First 20 product names, sorted.

So how can this be an index? The answer is that this isnt actually the index.
The link expression above is actually the index denition, this determines what
will be indexed (as well as exactly how, but well touch on that later). How does
that work?
Let us look at Listing 6.1, this is the external representation of the index, but
internally, we add need to track where the details came from so the end result
is:
from p r o d u c t i n d o c s . Products
s e l e c t new { p r od u c t . Name , ** p r o du c t . __document_id** }
The output of the index denition is a list of objects with a Name and a
__document_id property. But what can we do with this?
The following isnt actually how this work in RavenDB, well get to
the full details of that in a bit. This is an attempt to explain how
RavenDB works by simplifying things as much as possible.

116CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


Now that we have the data to be indexed, we can actually put this in the index.
From a logical perspective, Listing 6.2 shows what is going on after we have the
index
Listing 6.2: Logical view of storing index entries in the index
var i n d e x = new D i c t i o n a r y <s t r i n g , s t r i n g > ( ) ; // name > doc i d
f o r e a c h ( var index Entry i n r e s u l t s )
i n d e x [ indexEntry . Name ] = i ndex Ent ry . __document_id ;
And queries now become a simple issue of reading through the index and then
sending the results back, as shown in Listing 6.3.
Listing 6.3: Logical view of querying the index
// query f o r Name = Chang
var docId = i n d e x [ Chang ]
r e t u r n LoadDocument ( docId ) ;
Again, this isnt actually how it works, but it is a simple way of thinking about
this. And right now I want you to understand the general concept, rather than
the actual details.
The __document_id property
The __document_id is a reserved property named in RavenDB, it
maps to the document id of the document, regardless of the client
side convention.
So, we use a LINQ expression to select the elds to index from our documents.
We also add the relevant document id to the output, and then we put all of
those details in an index. When we query, we use the index to gure out what
is the actual document id that match the query, and then we load the document
(or documents) from storage by id, and send it to the client.
For the rest of this chapter (and in general), well use the following terminology:
Indexing function - the LINQ expression (such as the one in Listing 6.1)
used to project the elds to be indexed from the documents.
Index entry - the output of the indexing function. A single document
can output zero or more index entries.
Indexing run - the execution of all the indexes on a batch of documents.

6.2 Incremental indexing


By now, you have almost all the pieces you need to understand how RavenDB
indexing works. In Chapter 4, we discussed etags. An etag is just an ever increasing number that changes whenever a document changes. Because a RavenDB

6.3. THE INDEXING PROCESS

117

database can contain a lot of documents, it isnt practical to run the indexing
function over the entire data set every time. Instead, we do incremental indexing, and we do that using the etag. Listing 6.4 shows a simplied version of how
indexing works.
Listing 6.4: Highly simplied indexing process
while ( databaseIsRunning ) {
var l a s t D o c E t a g = GetLatestEtagForAllDocuments ( ) ;
var l a s t I n d e x E t a g = GetLastIndexedEtagFor ( Products /ByName ) ;
i f ( l a s t D o c E t a g == l a s t I n d e x E t a g ) {
// i n d e x i s up t o d a t e
WaitForDocumentsToChange ( ) ;
continue ;
}
var docsToIndex = LoadDocumentsAfter ( l a s t I n d e x E t a g )
. Take ( autoTuner . B a t c h S i z e ) ;
f o r ( var indexEntry i n in dex ing Fu n c ( docsToIndex ) ) {
S t o r e I n I n d e x ( indexEnt ry ) ;
}
S e t L a s t I n d e x E t a g ( Products /ByName , docsToIndex . Last ( ) . Etag ) ;
}
Ill repeat again that Listing 6.4 shows the conceptual model, the actual working
of this is very dierent. But the overall process is similar in intention if not in
practice.
Indexing works by pulling a batch of documents from the document storage,
applying the indexing function to them, and then writing the index entries to
the index. We update the last indexed etag, and repeat the whole process again.
When we run out of documents to index, we wait until a document is updated
or changed, and the whole process starts anew.
This means that when we index a batch of documents, we just need to update
the index with their changes, no need to do all the work from scratch. You can
also see in the code that we are processing the documents in batches (which
are auto tuned for best performance). Even though the code in Listing 6.4 is
far away from how it actually works, it should give you a good idea about the
overall process.

6.3 The indexing process


In relational databases, indexes are computed as part of the same write transaction. That leads to an interesting tradeo. If you dont have the right indexes,

118CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


you are falling back to a table scan, and all the performance degradation that
comes with that. So you want your indexes to cover all the columns you are
querying. But the more indexes you have, the slower writes become.
It ends up being up to the DBA, who need to make a judgment call. And often
the DBA need to make that judgment call without a lot of information. The
decision need to be made upfront, before the DBA has any actual performance
numbers (because the application hasnt been created yet). With RavenDB, we
chose a dierent approach.
Instead of executing the indexes as part of the write transactions, we run them
as a background task. You can see a hint of that in Listing 6.4. This code (or
rather, its conceptual cousin) is always running in the background, and whenever
there is a change in the documents, the indexing function is run on the new or
updated documents.
This has some interesting implications. The most obvious one is that there is an
inherit race condition between updating a document and the indexes catching
up with that update. Usually, the time between a document being updated and
the relevant indexes being updated is measured in milliseconds, usually between
15 ms to 25 ms.
Staleness in the real world
At rst sight, the idea of a query that may not be fully up to date
sounds scary. But in practice, this is how we almost always work
in the real world. The trivial counter example that I keep getting
thrown in my face is nancial transactions.
Of course that you need full consistency for everything nancial,
they tell me. Try this, call you bank and ask them how much money
you have in your account. The answer youll get is going to be some
variant of: As of last business day, you had.
Or let us take a pure software example, if you create a new bug, does
it matter if your manager sees it right away, or wait an additional
25 ms to get it? This design choice was made explicitly, and it has
quite a lot of benets to it (detailed below).
One very important factor to note is that RavenDB will always tell
you if you are getting potentially stale results, but more on that
later.
The other implications of this design decision is that you the kind of promise
the RavenDB database engine is drastically dierent than the kind of promises
a relational database engine will give. A relational database will promise you
that youll have a fully consistent view of the universe. The problem with that
is that it is very costly to actually do this. That is why relational databases
has the concept of transaction isolation levels, and why very few of them opt to
default to a high isolation level.

6.3. THE INDEXING PROCESS

119

It is just too costly to do so. In order to keep its promise, the relational engine
has to take locks, and do a lot of extra work to isolate dierent transactions
from each other. The more transactions you have, the higher the cost. Until at
a certain point, the relational database is overload and it will throw its hands
up in the air and go sit in the corner while it is having a funk.
Your application, in the meantime, will start erroring (if you are lucky) or just
hang, waiting for the relational database to respond.
With RavenDB, you get a dierent kind of promise. RavenDB will promise to:
Give your immediate results based on what we currently have in the index.
If the current state of the index isnt up to date, RavenDB will tell you
so, including how up to date the index is.
RavenDB will make its best to reduce the indexing latency.
You have the option to explicitly wait for the results to become non stale.
Documents are always consistent
RavenDB uses the terms stale and non-stale to refer to out of date
indexes. This is done intentionally, because consistency is always
maintained.
Listing 6.3 showed how queries work, from a logical perspective. We
rst consult the index to nd a match for the query we are executing.
Once we have the match, we have the relevant document id. Using
that document id, we go to the document storage and load that
document by id.
The RavenDBs document storage subsystem is fully consistent1 , so
you are always getting the latest committed version of the document.
Why is it important that RavenDB has a dierent set of promises than a relational database?
Because by making a dierent set of promises, we have opened ourselves to a
great deal of optimization opportunities. Here are just a few of those that are
implemented in RavenDB.

6.3.1 Throughput vs. Latency


I/O rates is usually the most expensive part of indexing. And it doesnt really
matter if we are indexing a single document or a hundred, the base I/O cost of
an indexing run is high enough to overshadow all other costs. Because of that,
RavenDB utilizes several strategies when it comes to indexing.
1 For the database nerds, which I assume that you are if you are reading this book, the
document storage subsystem implements Snapshot Isolation and optimistic write locks.

120CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


Under light load, RavenDB uses a low latency / low throughput strategy. Documents will be indexed as soon as possible, and while the I/O cost of that can
be high, it doesnt matter, because we would dont have enough load to saturate
the system. But when the load gets higher, RavenDB automatically switches
to a dierent strategy. The high latency / high throughput strategy. Under
those conditions, we are going to run fewer indexing runs, but each indexing run
will have more documents that need to be updated. This allows us to amortize
the I/O cost of indexing over a large set of documents, resulting in a greater
eciency all around.
What is with all the implementation details?
You dont really need to know all those details. And even though
Im going over the details, Im still leaving a lot of stu out. We
are constantly tweaking this part of RavenDB in attempt to achieve
higher performance and better throughput, so the details are subject
to change.
But it is good to have a proper understanding of at least the kind of
challenges that we have to go through, and the overall strategy to
handle them. It can help you understand why RavenDB behave in
a certain way in specic situations.
During indexing, we have to balance a fair number of variables to try to get to
an optimum result. Indexing uses a lot of memory, we need to have the source
documents to index, the actual output from the indexing function, generating
the index itself, etc. Indexing is also a compute intensive operation, especially
if you are using spatial indexing or require suggestions support. Indexing use a
lot of I/O, because it needs to read the source documents, and write the index
entries to disk.
Indexing scenarios
There are two common scenarios that pop up when talking about
indexing. The new index mode and the steady state mode. In the
new index mode, we have a new index dened, which need to index
the entire data set. With large data sets, that can take a while.
The steady set mode is the usual way RavenDB runs, when most of
the data is already indexed, and we only have to work on indexing
the new/updated documents as they are written to the database.
There are dierent optimizations for each scenario, and the interesting thing happen when we are actually creating a new index at the
same time as we have existing indexes and a steady rate of writes to
the database.
In other words, a typical production setting, especially since we can
expect indexes to be introduce to the system on a fairly regular basis

6.3. THE INDEXING PROCESS

121

in production (by the query optimizer, for example). Unlike many


relational databases, introducing a new index to a production system
result in no locks, and it is perfectly ne to do so. There is a cost
in the actual indexing, but RavenDB knows how to manage that to
avoid using too much.
The level of trust we have in this system is so high that we let an
automated component make indexing decisions on the y, to selfoptimize your database.
The indexing code in RavenDB need to balance competing requirements between
all those resources, and at the same time, make sure that we arent overwhelming
the machine we are running on. The other side of that is that if we have a lot
of resources on the machine (many cores, lot of memory, SSD drives), we want
to make the best use of those resources that we can. Let me see if I can explain
how we try to maximize resource utilization without using too much.

6.3.2 Parallel indexing


Let us assume that you have a database with a million documents in it, and
you just created ten new indexes. The server machine has 4 cores and 8 GB of
memory. What would happen now? Each of the indexes need to go over the
entire data set. One way to handle this is to spin up a thread for each index,
point it at the documents, and let it run. This has a lot of issues. To start with,
each of the indexes are going to need the same source data (the documents),
so just running each independently will result in a lot of duplicated I/O and
memory usage. It also turn out to be problematic because we might actually
generate so much CPU work that we wont have the time to process requests,
if we have ten threads doing compute intensive work.
The actual indexing process in RavenDB is split into runs. On each run, we load
a batch of documents, then select a number of indexes2 and let them run on
that batch. When those indexes are done, it will select the next set of indexes
to run, and so on until we are done with all the indexes that are relevant for
this batch.
Parallelism at the single index level
RavenDB also parallelize work at the single index level. A single
index that need to process a large number of documents will also execute the indexing work in parallel. The idea is maximize parallelism
at both the single index level and among all indexes.
2 the exact number depend on your conguration and license, but on the system we specied,
it will be 4 - 8, usually.

122CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


If there are still documents to be indexed, well then fetch the next batch of
indexes, and go through the process again. The reasoning behind limiting the
number of concurrent index executions is that each index is going to be doing
a lot of compute intensive work (indexing) as well as use up memory. By limiting the number of indexes that execute concurrently we can reduce the overall
indexing cost.

6.3.3 Auto tuning batches


We mentioned batches a few times, but let us talk about what those batches
actually are. At the simplest level, a batch is just a collection of documents,
just give me the next 128 documents after this etag by etag order. Because
we feed batches into the indexing process, the actual size of the batch is quite
important. The biggest the batch, the more work the indexes will do, and the
longer they will take to run. But because there are xed I/O costs per batch,
the bigger the batch, the less time we take per document, up to a point.
A question come up then, how do we decide what the batch size would be? This
seems to be a typical conguration value that well let the administrator decide.
But that is a pretty sensitive option, with regards to the system performance,
and we dont want to have the administrator on call to start modifying this conguration value in production if things change. That is why RavenDB doesnt
have a conguration value for this property. Instead, it has a conguration
range (min and max values). The RavenDB engine will determine the optimal
batch size within this range automatically.
When there isnt a lot of work to do, the batch sizes are going to be small,
favoring low latency (and low throughput). But when we are facing a high
write situation, or when we have to index a lot of data (such as when we have
a new index on a large database), RavenDB will note that the small batch sizes
arent enough to process the data quickly and start increasing the batch size.
This is controlled by many factors (current indexing time and available memory,
among many), to avoid creating a batch size that is too large, but in practice,
this works very well.
When we have a large number of items to index, we will slowly move to a larger
batch size and process more items per batch. This is the high throughput (but
high latency) indexing strategy that we talked about. When the load goes down,
well decrease the batch size to its initial level. This seemingly simple change has
dramatic implications on the way RavenDB can handle spikes in trac. Well
automatically (and transparently) adjust to the new system conditions. That
is much nicer than waking your administrators at 3 AM.

6.3. THE INDEXING PROCESS

123

6.3.4 What gets indexed?


This is an interesting question, and it often trips people up. Let us assume that
we have the two indexes showing in Listing 6.5.
Listing 6.5: A couple of index denitions
// u s e r s /ByNameAndEmail
from u s e r i n d o c s . U s e r s
s e l e c t new { u s e r . Name , u s e r . Email }
// o r d e r s / ByOrderIdAndCustomerEmail
from o r d e r i n d o c s . Orders
s e l e c t new { o r d e r . OrderId , o r d e r . Customer . Email }
The rst index users by name and email, and the second allow us to query on
orders by the order id or the customers email. What would happen when we
create a new Order document?
Well, we need to execute all the relevant indexes on this, and at rst glance, it
will appear that we only need to index this document using the second index.
The problem is that we dont know this yet. More to the point, we dont have
a good way of knowing that. We determine if an index needs to be indexed or
not by checking its last indexed etag and comparing that to the latest document
etag. That information doesnt take into account the details of which collection
belongs to.
Because of that, all indexes need to process all documents, if only to verify that
they dont care for those. At least, that is how it works in theory.
Remember, this is a deep implementation details discussion. Please
understand that the following details are just that, implementation
details, and are subject to change in the future.
In practice, we take quite a bit of advantage on the nature of RavenDB to
optimize how this works. We still need to read the documents, but we can
gure out ahead of time that a certain document isnt a match for a specic
index, so we can avoid even giving the document to the index. That means that
we can just update the last indexed etag of that index past any document that
doesnt match the collections that this index operates on.
That works quite eciently in reducing the amount of work required when we
are indexing new and updated documents only, but it gets more complex when
we need to deal with a new index creation.

6.3.5 Introducing a new index


I mentioned already that in RavenDB, the process of adding a new index is both
expected and streamlined. While this is indeed the case, it does not mean that

124CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


it is a simple process. A new index requires that well index all the previous
documents in the database. And we dont know, at this time, what collections
they actually belong to. So if we have ten million documents in the database
and we introduce a new index, well need to read them all, send the matching
documents to the index, and discard the rest. That is the case even if your
index only cover a collection that has a mere hundred documents.
As you can imagine, this is quite expensive, and not something that we are
very willing to do. Because of that, there are several optimizations that are
applied during this process. The rst of which is to use the system index
Raven/DocumentsByEntityName to nd out exactly how many documents we
have to cover. If the number is small (under 131,072 documents, by default),
well just load all the relevant documents and index them on the spot. This gives
us a great advantage when creating new indexes on small collections, because
we can catch up very quickly.
What happen when you have multiple new indexes?
So far, we talked about the scenario where we have just a single new
index, but it is pretty common to have multiple new indexes created
at roughly the same time.
This can happen during a new deployment, when you created several
new indexes. RavenDB is aware of this, and it will consider indexes
that are roughly in the same position with regards to indexing to be
part of the same group.
In particular, the way it is structured, creating multiple new indexes
in a short amount of time will tend to group them into their own
group and index all of them together.
However, that doesnt really help us in the case of bigger collections. What
happens then? At this point, one of two strategies comes into play, depending
on the exact situation involved. If there arent enough resources, the database
will split the work between the new index and the existing index. So the new
index will get a chance to run and index a batch of documents, then the existing
indexes will have a chance to run over any new documents that came in, etc.
RavenDB is biased in this situation toward the existing indexes, because we
dont want to stall them too much. That might mean that the new index will
take time to build completely, but that isnt generally an issue, it is a new index,
and expected to take some time.
If there is a wealth of resources to exploit, however, RavenDB will chose a
dierent strategy. It will create a dedicated indexing task that will run in the
background, separate from the normal indexing process and in parallel to it.
This indexing process will try to get as many documents index for the new
index as it possibly can as fast as it can. This generally require more resources
(memory, CPU & I/O), so even though this is the preferred strategy, we cant
always apply it.

6.3. THE INDEXING PROCESS

125

At any rate, introducing a new index is a well-oiled process by now, even on


large database, that it is a safe enough process that we let an automated system
decide when we need a new index.

6.3.6 I/O Considerations


I/O is by far the most costly part of the indexing eort, and the hardest to
optimize. This is because RavenDB can run on systems that have persistent
RAM disks (so reading from disk is eectively free) to virtual disks whose data
is actually fetched over the network (so I/O latency is very high). Note that this
applies for both reads and writes. Indexing need to read (potentially a large
amount) of documents so it can actually index them, and it need to write (much
smaller) amount of data to the index.
In order to best utilize the system resources, we have a component called the
prefetcher queue3 that is responsible for supplying the indexes with the next
batch. We do that using several ways. In the steady state mode, whenever a
new document is added / updated, we also register it in the prefetchers. That
means that normally, indexing does need to hit the disk at all, it can index
completely from memory.
For indexes that need to access data that is already on disk, we apply a dierent
optimization. Whenever the prefetcher is done handing out a batch, it start an
asynchronous process to load the next batch to memory. That way, when the
indexes are done and want the next batch, it is already ready for them and can
hand the next batch immediately.
The actual behavior is pretty complex, because we dont want to load so much
data so we wont have enough memory to do the actual indexing, and document
sizes arent xed. We also need to take into account the I/O speeds that we
can get as well as the document sizes. Because batch size can change, it is
possible to get a batch size that is large enough that it takes so long to fetch
that the optimization is useless. This is especially the case with virtual network
hardware (where high latency is the rule, rather than the exception).
Because of that, when loading data from the disk we are actually limiting ourselves on multiple fronts. Count of indexes, total size read and the time to read
from disk. If we hit any of those limitations, we immediately stop the reading
process and return with whatever data we already have. Otherwise, we might
be stuck waiting for all the records to load, while we can use the time to also
do indexing.
Because the prefetcher will fetch the next batch in the background, it is more
ecient to have a smaller batch and hand it for indexing, while we are fetching
the next batch. Otherwise, well spend a lot of time in I/O, without using the
CPU resources for indexing.
3 This

is used by the indexing and the replication processes.

126CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


The other side of I/O is writing the data. Usually, we write a lot less data than
we read for indexing, so that is a far less troublesome area. But here, too, we
have applied optimizations. When writing to the index, we always write directly
to memory rst and rstmost. And at the end of the index run, well not be
writing those changes to disk. Going to disk is expensive, so were trying to
avoid it if possible.
So, when do we write to disk? When one of the following happen:
When the amount of memory used cross a certain threshold. Currently
this is 5 MB4 , so after the index in memory hit that size, well ush this
to disk.
When there are no more documents to index and there is nothing else to
do.
When a certain time threshold has passed without ushing to disk.
This allows us to only go to the disk for writes when we really have to. In the
meantime, we are still able to give you full access to the indexed results directly
from memory. However, that does raise an interesting question. If we store
things in memory, what happen in the presence of failure?

6.4 Index safeties


RavenDB is an ACID database. That means that putting data into RavenDB
you are ensured that the only way to lose this data is if you physically take a
hammer to the disk drive. The same isnt quite true to indexes. Indexes are
updated in the background, and we do a lot of work to ensure that we give you
both fast indexing times and fast query times. That means that a lot of the
time, we operate on in memory data.
In other words, as soon as there is a failure, all this data goes away. Well, not
really. Remember, the data is only in memory up to a point, at which point it
get saved to disk. So at worst, if we have a hard crash, we lose some indexing
data, but we check this on startup, and that only means that well have to
re-index the last few documents that hasnt been persisted to disk yet. So we
are good. Or are we?
RavenDB uses Lucene5 as its indexing format, that gives us a lot of power,
because Lucene is a very powerful library. Unfortunately, it is anything but
simple to work with operationally, Ill touch on that in the next section, but
the important fact is that Lucene doesnt guarantee that its data will be safely
ushed to disk even if it actually does write to disk6 .
4 This

is congurable via the Raven/Indexing/FlushIndexToDiskSizeInMb setting


be rather more exact, we use Lucene.NET.
6 To the database nerds, the dierent is the lack of call to fsync() or its moral equivalent
when nishing writing. A crash can still cause the data written to the le to be lost
5 To

6.5. LUCENE

127

RavenDB take a proactive approach to handle that. On startup, we ensure


that the index is healthy, and if needed, well reset it to a previous point (or
entirely) to make sure that we dont lose data from the index. This is usually
only required after a hard machine failure, though. We have run RavenDB
through many simulations to make sure that this is the case. In one particular
test case, we managed to nd a bug after 80 consecutive hard crashes (pulling
the power cord from the machine)7 .
In short, documents stored in RavenDB are guaranteed to always be there,
even if you start pulling power cords and crashing systems. Index entries in the
index dont have this promise, but well ensure that we ll any missing pieces
(by simply re-running the index again over the source data) if something really
bad happened. We keep to our respective promises on each side. Documents
are safe and consistent, indexes are potentially stale and eventually consistent.

6.5 Lucene
I mentioned earlier that we are using Lucene to store our indexes. But what is
it? Lucene is a library to create an inverted index. It is mostly used for full
text searching and is the backbone of many search systems that you routinely
use. For example, Twitter and Facebook are using Lucene, and so does pretty
much anyone else. It has got to the point that other products in the same area
always compare themselves to Lucene.
Now, Lucene has a pretty bad reputation8 , but it is the de facto industry standard for searching. So it isnt surprising that RavenDB is using it, and doing
quite well with it. Well get to the details about how to use Lucenes capabilities
in RavenDB on the next chapter, now I would like to talk about how we are
actually using Lucene in RavenDB.
I mentioned that successfully running Lucene in production is somewhat of a
hassle for operations. This has to do with several reasons:
Lucene needs to occasionally compact its les (a process called merge).
Controlling how and when this is done is key for achieving good performance when you have a lot of indexing activities.
Lucene doesnt do any sort of veriable writes. If the machine crash
midway through, you are open for index corruption.
Lucene doesnt have any facility for online backup process.
Optimal indexing and querying speeds depend a lot on the options you
use and the exact process in which you work.
7 Ill also take this opportunity to thank Tobias Grimm, who was great help nding those
kind of issues
8 It is fairly involved to run, from operations point of view.

128CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


All of that require quite a bit of expertise. Weve talked about how RavenDB
achieve safety with indexes in the previous section. The others issues are also
handled for you by RavenDB. I know that the previous list can make Lucene
look scary, but I think that Lucene is a great library, and it is a great solution
for handling search.

6.6 Transformations on the indexing function


So far, we have gone through the details on how RavenDB indexes work (transforming a set of input document into Lucene index entries) and we spent a
considerable amount of time diving into the details behind the actual indexing
process itself. Now I want to focus primarily on what RavenDB with the index
denitions you create. Let us look at Listing 6.6 and Listing 6.7, which show a
simple index denition and what RavenDB does with it (respectively).
Listing 6.6: A simple index denition
// u s e r s /ByNameAndEmail
from u s e r i n d o c s . U s e r s
s e l e c t new { u s e r . Name , u s e r . Email }
What would happen when we create such an index in RavenDB? The RavenDB
engine will transform the simple index denition in Listing 6.6 into the class
shown in Listing 6.7. You can look at that and see the actual indexing work
that is being done with RavenDB9 .
Listing 6.7: The generated index class in RavenDB
p u b l i c c l a s s Index_users_ByNameAndEmail :
Raven . Database . Linq . Ab str ac tV ie wG e ne ra to r
{
p u b l i c Index_users_ByNameAndEmail ( )
{
t h i s . ViewText = @
from u s e r i n d o c s . U s e r s
s e l e c t new { u s e r . Name , u s e r . Email } ;
t h i s . ForEntityNames . Add( U s e r s ) ;
t h i s . AddMapDefinition ( d o c s =>
from u s e r i n ( ( IEnumerable<dynamic >) d o c s )
where s t r i n g . Equals (
u s e r [ @metadata ] [ RavenEntity Name ] , U s e r s ,
9 You can always get the source for each index denition by going to the following URL:
http://localhost:8080/databases/Northwind/users/ByNameAndEmail?source=yes (naturally,
youll need to change the host, database and index names according to your own system).

6.6. TRANSFORMATIONS ON THE INDEXING FUNCTION

129

System . S trin g Co mp a r i s o n . I n v a r i a n t C u l t u r e I g n o r e C a s e )
s e l e c t new {
u s e r . Name ,
u s e r . Email ,
__document_id = u s e r . __document_id
});
this
this
this
this
this
this
this
this
this

. AddField ( __document_id ) ;
. AddField ( Name ) ;
. AddField ( Email ) ;
. AddQueryParameterForMap ( __document_id ) ;
. AddQueryParameterForMap ( Name ) ;
. AddQueryParameterForMap ( Email ) ;
. AddQueryParameterForReduce ( __document_id ) ;
. AddQueryParameterForReduce ( Name ) ;
. AddQueryParameterForReduce ( Email ) ;

}
}
All indexes inherit from the AbstractViewGenerator class. And the actual indexing work is done in the lambda passed to the AddMapDenition call. You
can see how we changed the index denition. The docs.Users call, which looks
like a collection reference was changed to the more accurate where statement,
which lter unwanted items from dierent collections. You can also see that
in addition to the properties that we have for indexing, we also include the
__document_id property.
Note that we keep the original index denition in the ViewText property (mostly
for debug purposes), and that we keep track of the entity names each index covers. The latter is very important for optimizations, since we can make decisions
based on this information (which documents we can not send to this index).
The RavenDB Indexing Language
On the surface, it looks like the indexing language RavenDB is using
is C# LINQ expressions. And that is true, up to a point. In practice,
we have taken the C# language prisoner, and made it jump through
many hoops to make our indexing story easy and seamless.
The result isnt actually plain C#, for example, there are no nulls.
Instead, we use the Null Object pattern to avoid dealing with NullReferenceExceptions. Another change is that the language we use
isnt strongly typed, and wont error on missing members.
All of that said, you can actually debug a RavenDB index in Visual
Studio, because however much we twist the languages arm, we end
up compiling to C#.

130CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


The rest of the calls (AddField, AddQueryParameterForReduce, AddQueryParameterForMap)
are most there for book keeping purposes, and are used by the query optimizer
to decide if an index should get to handle a specic query.

6.7 Error handling


We try very hard to ensure that an index cant actually generate errors, but in
the real world, it isnt actually an attainable goal. So the question now becomes,
what is going to happen when an index runs into an error? Those can be divided
into several parts.
The indexing function can run into trouble. The easiest way to reproduce that
is to have a division in the indexing function, and have the denominator set
to zero. That obviously cause a DivideByZeroException. What happens then?
The indexing process will terminate the indexing of the document that caused
this issue, and an error will be logged. You can see the indexing errors in the
studio and in the index statistics.
Along with the actual error message, youll have the faulting index and the
document that caused all that problem. In general, errors in indexes arent a
problem, but because an error stops a document from being index (only by the
index that actually caused the error, mind) it can be hard to understand why.
A query on the index wont fail if some documents failed to be indexed, you
need to explicitly check the stats page (or the database statistics) to see the
actual error.
If a large percentage of the documents are erroring (over 15%, once we are past
some initial number of documents), however, well decree that index as faulty.
At that point, it will not be active any longer and wont take part in indexing.
Any queries made to the index will result in an exception being thrown. Youll
need to x the index denition so it wont throw so many errors for it to resume
standard operations.
Another type of indexing errors relates to actual problems in indexing. For
example, the indexing disk might be full. This will cause the indexing process
to fail, although that wouldnt count against the 15% fail quota for the index.
Youll be able to see the warning about those failures in the log.

6.8 What about deletes?


So far, we talked a lot about how indexing works for new or updated documents.
But how does the indexes work when we delete documents? The short answer
is that this is a much more complicated process, and it was quite annoying to
have to deal with it. The basic process goes like this: Whenever a document
is deleted, we check all the indexes for those who would cover this particular

6.9. INDEXING PRIORITIES

131

document. We then generate a RemoveFromIndexTask background task for


each of those indexes. We save that background task in the same transaction as
the document deletion. The indexing process will check for any pending tasks
as part of its process, and it will load and execute those tasks.
In this case, the work this task will do is to remove the relevant documents
from the indexes in question. The process is quite optimized, and well merge
similar tasks into a single execution, to reduce the overall cost. That said, mass
deletion, in particular, is a costly operation in RavenDB.
Note that as soon as the document is actually deleted from the document
store, we wont be returning it from any queries, and the purpose of the
RemoveFromIndexTask is to clean up the indexes more than anything else.

6.9 Indexing priorities


Users frequently request some better ways to control the indexing process priorities. To decide that a particular index is very important, and should be given
the rst priority above all other indexes. While this seems to be a reasonable
request, it does open up a lot of very complex issues. In particular, how do you
prevent starvation of the other indexes, if the very important index is running
all the time?
So instead of implementing a ThisIsVeryImportantIndex ag, we switched things
around and allow you to indicate that a particular index isnt that important.
The following indexing priorities are supported:
Normal - The default, execute this index as fast as possible.
Idle - Only execute this index when there is no other work to be done.
Abandoned - Only execute this index when there is no other work to be
done, and there hasnt been any work for a while, and it has been a long
time since we last ran this.
Disabled - Dont index at all.
Error - This index has too many errors and has failed.
Idle indexing will happen if RavenDB dont have anything else. Abandoned is
very similar to Idle, but it wont trigger even if we have nothing to do. It will
only trigger if we didnt have anything to do for a long while. The expectation
is that abandoned indexes will run whenever you have a long idle period. For
example, at night or over the weekends.
Why not have a Priority level?
Any priority scheme has to deal with starvation issues. And while it
seems like a technical detail, there is a big dierence in the expectation if you have an index set to idle and another index set to normal
than you have one index set to normal and the other to priority.

132CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION


In the rst case, it is easy to understand that the index wont run
as long as the normal index has work to do. In the second case,
you probably want both to run, but the priority to run more often
or with higher frequency. The problem with starvation prevention
is that you have to punish the important index at some point, by
blocking it and running the other indexes. At which point, they have
a lot of work to do, so they can take a long time to run, and you
defeated the whole point of priorities.
It might be a semantic dierence, but I feel that this way clearly
states what is going to happen, and reduce surprises down the road.
Note that the query optimizer will play with those options for the auto indexes
that it creates, but it wont interfere with indexes that were created explicitly
by the user.

6.10 Summary
Weve gone over the details of how RavenDB indexing actually work. Hopefully
not in a mind numbing detail. Those details are not important during your
work with RavenDB, all of that is very well hidden under the covers, but it is
good to know how RavenDB will respond to changing conditions.
We started talking about the logical view of indexing in RavenDB. How an
indexing function output index entries that will be stored in a Lucene index,
and how queries will go against that index to nd the matching document ids,
which will be pulled from document storage. We then talked about incremental
indexing and the conceptual process of how RavenDB actually index documents.
From the conceptual level, we moved to the actual implementation details, including the set of tradeos that we have to make in indexing between I/O, CPU
and memory usage. We looked at how we deal with each of those issues. Optimizing I/O by prefetching documents and batching writes. Optimizing memory
by auto tuning the batch size and optimizing CPU usage by parallelizing work
(but not too much).
We also talked about what actually gets indexed, and how we optimize things
so an index doesnt have to go through all documents, only those relevant for
it. Then we talked about the new index creation strategies, how we try to make
sure that this is as ecient as possible while still letting the system operate
normally.
We got to talking a bit about Lucene, how we actually manage the index and
safe guard from corruption and handle recovery. In particular by managing the
state of the index outside of Lucene, but by checking to see the recovered state
in case of a crash.

6.10. SUMMARY

133

We concluded the chapter by talking about the actual code that gets run as
part of the index, error handling and recovery during indexing and the details
of index priorities and why they are setup the way they are.
I hope that this peek behind the curtain doesnt make you lose any faith in the
magical properties of RavenDB, pay no attention to the man behind the screen,
as the Wizard said. Even after knowing how everything work, it still seems
magical to me. And one of the most magical features in RavenDB is the topic
of the next chapter, how RavenDB allows ad-hoc queries by using automatic
indexing and the query optimizer.

134CHAPTER 6. INSIDE THE RAVENDB INDEXING IMPLEMENTATION

Chapter 7

The query optimizer and


dynamic queries
We started the previous chapter by talking about the following query, and how
it doesnt work. In this chapter, well spend our time talking about the entire
process that make it possible to run such queries.
var r e c e n t O r d e r s Q u e r y =
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Company == companies /1
s e l e c t order ;
var r e c e n t O r d e r s = r e c e n t O r d e r s Q u e r y . Take ( 3 ) . ToList ( ) ;
In RavenDB terms, the recent orders query is actually a dynamic query. We
dont explicitly specify what index we want to use, and we just run the query
as if we had no care in the world, and let RavenDB take care of all the details.
But how is this actually done?
The RavenDB Client API translate the above query into the following REST
call:

GET / d a t a b a s e s / Northwind / i n d e x e s / dynamic / Orders?&query=Company : companies/1& p a g e S i z


If we break the call to its components parts, well have:
/databases/Northwind - Using the Northwind database.
/indexes/dynamic/Orders - Making a dynamic query on the Orders collection
?&query=Company:companies/1 - Where the Company property contains
the companies/1 value.
135

136 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


&pageSize=3 - Get the rst 3 results
A dynamic query like that is going to be sent to the query optimizer for processing.

7.1 The Query Optimizer


The query optimizer is responsible for handling dynamic queries. The rst part
of its duties is to actually dispatch the query to the appropriate index. Listing
7.1 shows us how we can gure out what it actually did.
Listing 7.1: Using the query statistics to nd the index used for the query
RavenQueryStatistics s t a t s ;
IRavenQueryable<Order> r e c e n t O r d e r s Q u e r y =
from o r d e r i n s e s s i o n . Query<Order >()
. S t a t i s t i c s ( out s t a t s )
where o r d e r . Company == companies /1
s e l e c t order ;
L i s t <Order> r e c e n t O r d e r s = r e c e n t O r d e r s Q u e r y . Take ( 3 ) . ToList ( ) ;
Co n s ol e . WriteLine ( Index used was : + s t a t s . IndexName ) ;
When we run the code in Listing 7.1, we will see that the indexed used was
Orders/Totals, one of the sample indexes in the Northwind database. How
did this happen? We certainly didnt specify that ourselves.
What happened was that the query optimizer got this query, and then it went
over the indexes, looking for all the indexes that index the Orders collection and
output the Company property to the index. When it found such an index, it
threw a party and then executed the query on that index.
The query statistics
The RavenQueryStatistics and the . Statistics ( stats ) call provide
a wealth of information about the just executed query. Among the
details you can get from the query statistics you have:

Whatever the index was stale or not.


The duration of the query on the server side.
The total number of results (regardless of paging).
The name of the index that this query run against.
The last document etag indexed by the index.
The timestamp of the last document indexed by the index.

7.1. THE QUERY OPTIMIZER

137

Figure 7.1: The RavenDB query optimizer likes to chase down queries and send
them to the right indexes.

138 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


In addition to that, you can use the .ShowTiming() to get additional
detailed information about the execution time of the query on the
server.
But what would happen if we executed a query that had no index covering it?
For example, what would happen if we run the code in Listing 7.2?
Listing 7.2: No index exists to cover this query
RavenQueryStatistics s t a t s ;
IRavenQueryable<Order> r e c e n t O r d e r s Q u e r y =
from o r d e r i n s e s s i o n . Query<Order >()
. S t a t i s t i c s ( out s t a t s )
where o r d e r . Company == companies /1
&& o r d e r . ShipTo . Country == I t a l y
s e l e c t order ;
L i s t <Order> r e c e n t O r d e r s = r e c e n t O r d e r s Q u e r y . Take ( 3 ) . ToList ( ) ;
Co n s ol e . WriteLine ( Index used was : + s t a t s . IndexName ) ;
Note that the only change between Listing 7.1 and Listing 7.2 is that the addition
of && order.ShipTo.Country == Italy to the query. But because we have
this additional property, we cant use any existing index. What will the query
optimizer will do?
Well, executing this code tells us that the index used is named: Auto/Orders/ByCompanyAndShipTo_Country. And if we look at the studio, it is dened
as:
from doc i n d o c s . Orders
s e l e c t new
{
ShipTo_Country = doc . ShipTo . Country ,
Company = doc . Company
}
What just happened? We didnt have this index just a minute ago!
What happened was that the query optimizer got involved. It got the query,
which required us to have an index on Orders that indexed both the Company
property and the ShipTo.Country property. But there was no such index in
existence. At that point, the query optimizer got depressed, tried to drink
away its troubles and considered vacation in Maui when that failed. Coming
back from vacation, tanned and much happier, the query optimizer got down to
business. We have a query that we have no index for, and RavenDB does not
allow the easy and nasty solution of just iterating over all the documents in the
database, also known as table scans, also known as 3 AM wakeup calls.

7.1. THE QUERY OPTIMIZER

139

So the query optimizer decided to create such an index. And hence, an index
is born. The proud parent watched over the new index, ensuring that it does
its job properly, and nally released it to the wild, to roam free and answer any
queries that would be directed its way.
Ad hoc queries werent supposed to be there
When RavenDB was just starting (late 2010), we already had a user
base and a really cool database. What we didnt have was ad hoc
queries. If you wanted to query the database, you had to write an
index to do so, and then you had to explicitly specify which index
would be used in each query. That was a hassle, but there was really
no other way around that. We didnt want to do anything that
would force a table scan, and there was no other way to support this
feature.
Then Rob Ashton pops in the mailing list and start sending us crazy
bug reports with very complex map/reduce indexes1 . And he start
making silly proposals about dynamic queries, stu that would obviously not work.
The end result was that I asked him to show me some code, with
the fully expectation that I would never hear from him again.
He came back a day later with a functional proof of concept. After
I managed to pick my jaw o the oor, I was able to gure out what
he was doing and got very excited2 .
Once we had the initial idea, we basically took it up and run with
it. And the result is a very successful feature and this chapter, too.
Leaving aside the anthropomorphism of the query optimizer, what is going on
is that the query optimizer reads the query, and try to match a relevant index
for it. If it cant nd a relevant index, it will create the index that can answer
this query. It will then start executing the query against the index. Because
indexing can take some time, it will wait until the query has enough results to
ll a single page, or 15 seconds has passed (or it completed indexing, of course)
before it will return the results to the user3 .

7.1.1 The life cycle of automatic indexes


Indexes that are created by the query optimizer are named with the Auto/
prex. This indicate that they were created by the query optimizer and that
they are being managed by the query optimizer as well.
1 In general, I found that Rob was very gifted in the art of very quickly breaking any of my
code that got near his orbit
2 RavenDB is Open Source Software exactly because of moments like that, when someone
can come and turn the whole thing around with a new idea.
3 To do otherwise would ensure that the very rst dynamic query of a certain type would
always result in no results being returned, which would be confusing.

140 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


An index in RavenDB is very ecient, but it isnt actually free. When we allow
the query optimizer to just create an index on the y, we also run the risk that
so many indexes would be created that it would be a performance hog. In order
to handle that, the query optimizer doesnt just create an index and set it free,
it is actively managing it.
A dynamic index (any index with the Auto/ prex, which only the query optimizer should generate) is being tracked for usage. If a dynamic index isnt
being used, it will continuously downgraded and eventually removed. The actual process is a bit involved and contain some heuristics to avoid index churn
(creating and deleting auto indexes).
An auto index is created, and from that point, it is things as usual as far as
that index is concerned. It get lled in pretty much the same way as any other
index. However, if the index isnt queried, we have to make a decision about
what to do with this index. That depend on the indexs age. If the index is
older than an hour, it means that it had enough queries to hold it in the air for
a long period of time, which in turn means that it is very likely that it will be
used again.
In that case, the index goes down the reduce resource usage mode, which will rst
mark it as idle, and eventually abandoned. But if this is a new index, it means
that we probably just tried some query out, or had a one o administrative
query. In that case, well retire the index into an idle mode, and after that well
delete it.
Over time, this result in a system that has only the indexes that it needs,
and usually most of those indexes were created for us by the query optimizer.
Another nice feature is that changes in behavior (for example, because of a new
release) will result in the database optimizing itself for that behavior.
But the query optimizer still have some tricks to show us, let us talk about
index merging.

7.1.2 Index merging


A common question that is being asked in the mailing list is: Is it better to
have few fat indexes or more numerous lean indexes?
The answer to that question is quite emphatically that we want the fewer and
bigger indexes. A lot of the cost around indexing is around the actual indexing
itself. The cost per indexed eld is so minor it doesnt usually matter very much
at all. That means that we have somewhat of a problem here, we create indexes
on the y, and most of the time they are created as indexes to answer a very
particular query. Wouldnt that cause a lot of small indexes to be created?
Listing 7.3 shows several such queries run over products, each with a dierent
set of elds it needs.

7.1. THE QUERY OPTIMIZER

141

Listing 7.3: Running multiple dynamic queries on Products


RavenQueryStatistics st a ts ;
var q = from p r o d uc t i n s e s s i o n . Query<Product >()
. S t a t i s t i c s ( out s t a t s )
where p r o d u ct . D i s c o n t i n u e d == f a l s e
s e l e c t p r od u c t ;
q . ToList ( ) ;
C o n s o l e . WriteLine ( s t a t s . IndexName ) ;
q = from p r o d u c t i n s e s s i o n . Query<Product >()
. S t a t i s t i c s ( out s t a t s )
where p r o d u ct . Category == c a t e g o r i e s /1
s e l e c t p r od u c t ;
q . ToList ( ) ;
C o n s o l e . WriteLine ( s t a t s . IndexName ) ;
q = from p r o d u c t i n s e s s i o n . Query<Product >()
. S t a t i s t i c s ( out s t a t s )
where p r o d u ct . S u p p l i e r == s u p p l i e r s /2
&& p r od u c t . D i s c o n t i n u e d == f a l s e
s e l e c t p r od u c t ;
q . ToList ( ) ;
C o n s o l e . WriteLine ( s t a t s . IndexName ) ;
The output of the code in Listing 7.3 is very interesting:
Auto/ Products / ByDiscontinued
Auto/ Products / ByCategoryAndDiscontinued
Auto/ Products / ByCategoryAndDiscontinuedAndSupplier
Just by the index names, you can probably guess what is going on. In order to
reduce the number of indexes involved, when we create a new dynamic index,
and other dynamic indexes for that collection also exists, we will merge them
all.
What about the old dynamic indexes
What does the query optimizer do with the Auto/Products/ByDiscontinued index once it creates the Auto/Products/ByCategoryAndDiscontinued index? And what does it does with the Auto/Product-

142 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


s/ByCategoryAndDiscontinued once it creates the Auto/Products/ByCategoryAndDiscontinuedAndSupplier index?
The surprising answer is that it doesnt do anything. It doesnt need
to. Those indexes are there, and would be cleaned out by the usual
query optimizer process once they stop being used. Eventually, they
will be removed, but there is no rush to do this right now, and that
might actually hurt things.
The end result of the code in Listing 7.3 would be a single fat index that can
answer all our queries about the documents in a particular collection. That
is the most optimal result in terms of indexing decisions. But that raise an
interesting question, what would happen if we run the same code again, against
the current database with the new automatic indexes?
If youll do that, youll see that the only index being used is: Auto/Products/ByCategoryAndDiscontinuedAndSupplier.

7.1.3 Dynamic index selection


When the query optimizer has more than a single choice for the index, in need
to make a selection between those choices. And the choice it makes it usually
based on a very simple metric. The width of the index. The wider the index,
the more work it does, the easier it is to send it more queries, so well favor it.
Behind this decision there is the knowledge that automatic indexes that dont
get enough queries will be removed, so just the fact that we arent directly any
queries to an index would end up removing it for us.
However, there is one scenario in which RavenDB will select an index to execute a query even if it isnt the widest index that can serve. Consider the
following case, you have a document with a lot of documents, and an existing
auto index (such as Auto/Products/ByCategoryAndDiscontinued) and all of a
sudden a new query comes by that require you to create a new index (such as
Auto/Products/ByCategoryAndDiscontinuedAndSupplier).
Queries that used to be served by Auto/Products/ByCategoryAndDiscontinued can now be served by Auto/Products/ByCategoryAndDiscontinuedAndSupplier, but there is a problem. Auto/Products/ByCategoryAndDiscontinuedAndSupplier is a new index, and as such, didnt have the chance to go
through all the data. If we direct queries to it that can be answered by Auto/Products/ByCategoryAndDiscontinued, we might miss out on information
that Auto/Products/ByCategoryAndDiscontinued already have indexed.
Because of that, we also consider how up to date an index is, and well prefer
the freshest index rst, then the widest.
Querying the query optimizer

7.2. COMPLEX DYNAMIC QUERIES

143

You can ask the query optimizer what it was thinking, to give a particular index the chance to run a particular query. You can do that
using the following REST call: GET /databases/Northwind/indexes/dynamic/Products?query=Category:categories/2&explain=true
/databases/Northwind - the database we use
/indexes/dynamic/Products - dynamic query on the Products
collection
?query=Category:categories/2 - querying all products in a particular category
&explain=true - explain what you were thinking.
This can be helpful if you want to understand a particular decision,
although most of the time, those are self-evident.
When the Auto/Products/ByCategoryAndDiscontinuedAndSupplier index is
up to date, we will start using only that index. And no queries will go to
Auto/Products/ByCategoryAndDiscontinued. At that point, the self-cleaning
features of the query optimizer will come into play, and because this index isnt
in use any longer, it will be demoted to idle and then deleted or abandoned,
depending on age.
The query optimizer does quite a bit of work, but we have only seen part of it.
We looked at how it managed the indexes, but now let us look at the kind of

7.2 Complex dynamic queries


Our queries so far were pretty trivial. We queried on a root property of a
document, and that was pretty much it. That is ne as it goes, but we are
going to need more than that in real world applications. We are going to start
looking into more complex queries, and how they are actually implemented as
indexes on the server, then well talk about the actual querying options that we
have available for us using dynamic queries.

7.2.1 Querying collections


How about a query on all orders that has a specic product? Here is the code:
var q = from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( x => x . Product == p r o d u c t s / 1 )
s e l e c t order ;
There are a few things going on here. We are able to query deep into the object
structure, using the Any method. But note that we are actually pulling back

144 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


the full document back. Listing 7.4 shows the index that was auto generated to
answer such a query:
Listing 7.4: Auto generated index for searching for a purchased product on
orders
from doc i n d o c s . Orders
s e l e c t new {
Lines_Product = (
from d o c L i n e s I t e m i n ( ( IEnumerable<dynamic >)doc . L i n e s )
. D e fa ul t If Em p ty ( )
s e l e c t d o c L i n e s I t e m . Product ) . ToArray ( )
}
}
This index will project all the values in the Lines Products into a single
Lines_Product property, which is what we are actually querying. Note that
this index will output a single index entry per document. That is generally a
better idea, and how dynamic queries on collections work in RavenDB 3.04 .
However, extending this line of thinking forward, what happens when we want
to create an even more complex query? A query for orders by a specic company
for a specic product with a certain quantity, for example? The code for this
query is in Listing 7.5.
Listing 7.5: Complex dynamic query and the resulting index
var q = from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Company == companies /1 &&
o r d e r . L i n e s . Any(
x => x . Product == p r o d u c t s /1 &&
x . Quantity == 4
)
s e l e c t order ;
// g e n e r a t e d i n d e x
from doc i n d o c s . Orders
s e l e c t new {
Lines_Quantity = (
from d o c L i n e s I t e m i n ( ( IEnumerable<dynamic >)doc . L i n e s )
. D e fa ul t If Em p ty ( )
s e l e c t d o c L i n e s I t e m . Quantity ) . ToArray ( ) ,
Lines_Product = (
from d o c L i n e s I t e m i n ( ( IEnumerable<dynamic >)doc . L i n e s )
. D e fa ul t If Em p ty ( )
s e l e c t d o c L i n e s I t e m . Product ) . ToArray ( ) ,
4 In 2.5, well have separate index entry for each line, which caused fan-out problems on
large orders.

7.2. COMPLEX DYNAMIC QUERIES

145

Company = doc . Company


}
This index, too, generates a single index entry.
But note that the
Lines_Quantity and Lines_Product are calculated separately. That means
that this query can return the wrong result if we have an order for this specic
company with the right product, but the right quantity is on another order
line.
This issue happens because we collapse the index output into a single index
entry, and thus we get a false positive. Handling this properly is a topic for
the next chapter. For now, just be aware that using dynamic query, RavenDB
will eectively rewrite a query such as the one in Listing 7.5 to a query line in
Listing 7.6.
Listing 7.6: How RavenDB translates multiple collection clauses queries
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Company == companies /1 &&
(
o r d e r . L i n e s . Any( x => x . Product == p r o d u c t s / 1 ) &&
o r d e r . L i n e s . Any( x => x . Quantity == 4 )
)
s e l e c t order ;
This was a choice made to avoid potentially devastating fan-outs (an index that
generates multiple index entries per each document indexed, sometimes a lot of
index entries per document). Using a static index (covered in the next chapter)
deals with this issue.

7.2.2 Querying capabilities


Most of the time, when you query using RavenDB, youll be using the LINQ
API. We just encountered one limitation with querying into collections using
that API, what can and cannot be done with RavenDB queries?
Dynamic queries are inherently limited
You can do quite a lot with dynamic queries, but in the end, they are
limited to a very few operations. Comparing a value to a property
on the object (or nested object or a collection object) is pretty much
it.
For more complex queries, such as full text search, spatial queries or
using computed results, well use static indexes.
We have already seen that we can compare to a value, so equality is obviously
something that we can do. We can also query by range (greater or smaller than

146 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


a value, or between two values). Listing 7.7 shows a few of the query options
that we have:
Listing 7.7: Various querying options in RavenDB dynamic queries
// e q u a l i t y on document p r o p e r t y
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Company == companies /1
s e l e c t order ;
// l e s s than on document p r o p e r t y
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . OrderedAt < new DateTime ( 2 0 1 4 , 1 , 1 )
s e l e c t order ;
// g r e a t e r than o r e q u a l s on a n e s t e d c o l l e c t i o n o b j e c t p r o p e r t y
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( l => l . D i s c o u n t >= 5 )
s e l e c t order ;
// i n on a document p r o p e r t y
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Employee . In ( employees / 1 , employees / 2 , employees / 3 )
s e l e c t order ;
// s t a r t s with on a n e s t e d o b j e c t p r o p e r t y
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . ShipTo . Country . S t a r t s W i t h ( I s )
s e l e c t order ;
// complex c o n d i t i o n a l s
from o r d e r i n s e s s i o n . Query<Order >()
where (
o r d e r . OrderedAt < new DateTime ( 2 0 1 4 , 1 , 1 )
||
(
o r d e r . ShipTo . Country . S t a r t s W i t h ( I s ) &&
o r d e r . L i n e s . Any( l => l . D i s c o u n t >= 5 )
)
)
s e l e c t order ;
As you can see in Listing 7.7, you can do quite a lot with dynamic queries. Most
simple queries in RavenDB are using dynamic queries. But there is also a lot
that you cannot do.
Consider the following query:
from o r d e r i n s e s s i o n . Query<Order >()

7.2. COMPLEX DYNAMIC QUERIES

147

from l i n e i n o r d e r . L i n e s
where l i n e . D i s c o u n t >= 5
s e l e c t order ;
Vs.
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( l => l . D i s c o u n t >= 5 )
s e l e c t order ;
The rst query and the second are conceptually the same, but the second one
uses an Any method, instead of the multiple from clauses (or SelectMany, if you
are using the method syntax). Conceptually the same, in the sense that this
will result in the same ltering going on. But the output of those queries are
very dierent.
In the rst query, youll get the order back as many times as you have lines in
the order, while in the second query youll get the order exactly once. That, and
the exploding complexity of trying to parse arbitrary LINQ queries, has caused
us to limit ourselves to the simpler syntax.
Group by clauses, let clauses, multiple from clauses, or join clauses in LINQ are
also not supported for queries. They dont make a lot of sense for a document
database, and while we have better alternative for those, the syntax exposed by
LINQ doesnt make it possible to expose them easily.
Ordering, however, is fully supported, as you can see in the following query:
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( l => l . D i s c o u n t >= 5 )
o r d e r b y o r d e r . ShipTo . City d e s c e n d i n g
s e l e c t order ;
It is important to remember that in RavenDB, querying is done on each document individually, it isnt possible to query a document based on another
documents properties. Well, to be rather more exact, that is certainly possible,
but that isnt done via dynamic queries. Well touch on that as well in the next
chapter.

7.2.3 Includes in queries


What you can do with dynamic queries and associated documents is to Include
them. We rst run into includes in Chapter 3, but they are mostly useful during
queries. Let us say that we want show the ten most recent orders, along with
the company name that made that order, as showing in Listing 7.8.

148 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


Listing 7.8: Printing the top 10 recent orders and their company name, ineciently
var q = from o r d e r i n s e s s i o n . Query<Order >()
o r d e r b y o r d e r . OrderedAt d e s c e n d i n g
s e l e c t order ;
f o r e a c h ( var o r d e r i n q . Take ( 1 0 ) )
{
var company = s e s s i o n . Load<Company>( o r d e r . Company ) ;
C o n s ol e . WriteLine ( company . Name + , + o r d e r . OrderedAt +
, + o r d e r . ShipTo . Country ) ;
}
The problem with the code in Listing 7.8 is that it generates a lot of remote
calls to the database, 11 of them, to be exact. We can reduce all of that cost
into a single remote call by utilizing includes, as seen in Listing 7.9.
Listing 7.9: Eciently getting the latest orders and their company
var q = from o r d e r i n s e s s i o n . Query<Order >()
. I n c l u d e ( o=>o . Company )
o r d e r b y o r d e r . OrderedAt d e s c e n d i n g
s e l e c t order ;
Now we only go to the database once, and the rest of the code remains just the
same. You can also specify multiple include clauses, for example, if you wanted
to load the company and the employee documents along with the query results.
This takes care of the 2rd Fallacy of Distributed Computing (latency is zero) by
reducing the number of remote calls (and the price we have to pay for each of
those). It doesnt takes care of handling the 3rd & 7th Fallacy (bandwidth is
innite & transport cost is zero). For those, we have projections and transformers.

7.2.4 Projections
Consider the following query, what would be its output?
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . Employee = employees /1
s e l e c t order ;
Well, it is pretty obvious that we are going to get the full order document back.
And this is great, if we wanted the full document. But a lot of the time, we
just want very few properties back from the query results, just enough to show
them on a grid. How would that work? About as simply as you can think, see:

7.2. COMPLEX DYNAMIC QUERIES

149

from o r d e r i n s e s s i o n . Query<Order >()


where o r d e r . Employee = employees /1
s e l e c t new { o r d e r . Id , o r d e r . Company , o r d e r . OrderedAt } ;
Using this method, we can get just the relevant properties out, and not have to
pay to shue full documents from the server to our client5 . This reduction in
size can save us a lot in terms of data transfer costs alone, even leaving aside
the time factor.
This is great, except that we can only pull the data from the source document.
We cant pull data from an associated document, so we still have to use Include
for that, which give us the whole document. But we want just a property or
two from there, nothing more.
In order to handle that, we can use transformers.

7.2.5 Result Transformers


We already looked at Result Transformers in Chapter 5, in fact, we had to deal
with this exact same problem there. Result transformers are server side LINQ
statements that allow us to modify the output of the server.
Because they run server side, they have a lot of freedom to do things that we
cant do from the client side. And one of those things is to pick just the right
data to hand back to you.
Let us look at 7.10, which shows a result transformer and its usage in a query6 :
Listing 7.10: Projecting data from multiple documents in a result transformer
using queries
p u b l i c c l a s s OrdersHeaders :
A b s t r a c t T r a n s f o r m e r C r e a t i o n T a s k <Order>
{
public c l a s s Result
{
p u b l i c s t r i n g Id { g e t ; s e t ; }
p u b l i c s t r i n g CompanyName { g e t ; s e t ; }
p u b l i c DateTime OrderedAt { g e t ; s e t ; }
}
p u b l i c JustOrderIdAndcompanyName ( )
{
T r a n s f o r m R e s u l t s = o r d e r s =>
from o r d e r i n o r d e r s
5 Remember

the Fallacies of Distributed Computing! Or youll regret that in the future.


trying to use transformers, make sure to having using Raven.Client.Linq; at the top
of the le, to expose the TransformWith extension method.
6 When

150 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


l e t company = LoadDocument<Company>( o r d e r . Company )
s e l e c t new
{
o r d e r . Id ,
CompanyName = company . Name ,
o r d e r . OrderedAt
};
}
}
var q = from o r d e r i n s e s s i o n . Query<Order >()
o r d e r b y o r d e r . OrderedAt d e s c e n d i n g
s e l e c t order ;
var o r d e r s = q . TransformWith<OrdersHeaders , OrdersHeaders . R e s ul t >()
. Take ( 1 0 ) ;

f o r e a c h ( var o r d e r i n o r d e r s )
{
C o n s ol e . WriteLine ( { 0 } \ t {1}\ t { 2 } , o r d e r . Id , o r d e r . CompanyName , o r d e r
}

Using this method, we are able to pick just the data that we want, and only
send the relevant details to the client. There are a few important aspects to
note here. The TransformWith method call takes a query, and operates over the
results of this query. Note that even actions that appear to happen later (like
the Take(10) call) will be applied before the transformer is run.
In other words, on the server side, well only get 10 orders to work through inside
the transformer. That means that well only need to run the transformation (and
load the associated company) 10 times.
Another result of this decision is that by the time the transformer is called, all
the paging and the sorting has already been done for it, so while it can apply
its own ltering and sorting, the data it will received has been processed.
It is very common to use transformers for getting the data to show whenever you
have a grid or a table of some sort. You can pick just the data you want (or even
do this dynamically), and you can even merge data from multiple documents, if
that makes sense.
All of that assumes that you have relatively large documents, and that you only
want to show a few of their properties, if you actually need the whole thing. If
that doesnt hold true, just load the entire document.

7.3. DOCUMENTQUERY VS. LINQ QUERIES

151

7.3 DocumentQuery vs. LINQ queries


The language integrated query is one of my favorite features in .NET. I distinctly remember watching the PDC session in 2005 and feeling insulted. .NET
2.0 wasnt ocially out yet, but it already felt clunky compared to the LINQ
statements shown in the PDC.
Then .NET 3.5 came out, and we actually got to see the kind of work that was
involved in writing LINQ. From the customer, it was pure joy, most of the time.
From the providers point of view, however, things were much dierent.
I had the chance to take part in several dierent LINQ providers, including
NHibernates attempt to provide full & complete support for everything you
can do in LINQ. That was hard.
With RavenDB, we had to deal with the fact that even though LINQ isnt
database dependent, a lot of the operations had a distinctly relational smell to
them. The join clause pops to mind, of course. But RavenDB has a pretty
good LINQ support, even if that cost us in a lot of eort.
The cost of supporting LINQ
Building the core RavenDB engine took us about 2 - 3 months. It
was shamefully feature poor, compare to what RavenDB is today,
but we are still talking about storage, documents, indexing, querying,
map/reduce, distributed transactions, etc.
Building the LINQ provider for RavenDB took us 3 months.
Yes, that isnt a typo. Building the LINQ provider cost us more
than building the entire database!
So you can say that I have a lot of mixed feelings about LINQ.
But as great LINQ is, its greatest asset is also its weakness. It is compiled time
checked, which is great if you want to ensure that your code isnt going to query
Customer7 and then spend a lot of time not guring out what is wrong.
It isnt so great when you want to generate dynamic queries. In relational
databases, this is the time you start breaking the string concatenation tools
from the torture chamber, or the Query Object from the Pattern basement. In
RavenDB, you switch to using the DocumentQuery API8 .
Layered APIs
The LINQ provider is actually built on top of the DocumentQuery
API, so anything that the LINQ query can do, so can the
DocumentQuery API. This pattern, of layering APIs on top of
7 Note
8 This

the spelling issue.


API was previously called Lucene Query API.

152 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


one another, with each layer providing additional services is very
common in RavenDB.
We already run into that with the Session API on top of the
DatabaseCommands API on top of REST calls.
Accessing the DocumentQuery API is done like this:
var o r d e r s = s e s s i o n . Advanced . DocumentQuery<Order >()
. WhereEquals ( x => x . Company , companies / 1 )
. AndAlso ( )
. WhereIn ( Employee , new [ ] { employees / 1 , employees / 2 } )
. ToList ( ) ;
As you can see, even in the DocumentQuery API, you still have the ability to
use type safe methods, or to just use strings. Note that the query object is being
mutated by the calls, and nally triggered via the .ToList() call.
You can also call .ToString() on the query, to see what will be sent to the server.
In the case of the query above, well have the following query:
Company : companies /1 AND @in<Employee >:( employees / 1 , employees / 2 )
This is using RavenDBs modied Lucene syntax, In general, the DocumentQuery
API provides the following query options, which translate to a Lucene fragment.
You can see the behavior in Listing 7.11.
Listing 7.11: DocumentQuery options and their query syntax
q . WhereLessThan<DateTime>(o => o . OrderedAt , DateTime . Today ) ;
OrderedAt : { * TO 20140907T00 : 0 0 : 0 0 . 0 0 0 0 0 0 0 }
q . WhereLessThanOrEqual<DateTime>(o => o . OrderedAt , DateTime . Today ) ;
OrderedAt : [ * TO 20140907T00 : 0 0 : 0 0 . 0 0 0 0 0 0 0 ]
q . WhereGreaterThan<decimal >(o => o . F r e i g h t , 5M) ;
Freight_Range : { Dx5 TO NULL}
q . WhereGreaterThanOrEqual<decimal >(o => o . F r e i g h t , 5M) ;
Freight_Range : [ Dx5 TO NULL]
q . WhereBetween<decimal >(o => o . F r e i g h t , 5M, 10M) ;
Freight_Range : { Dx5 TO Dx10}
q . WhereBetweenOrEqual<decimal >(o => o . F r e i g h t , 5M, 10M) ;
Freight_Range : [ Dx5 TO Dx10 ]
q . WhereStartsWith<s t r i n g >(o => o . ShipVia , UP ) ;
ShipVia :UP*

7.3. DOCUMENTQUERY VS. LINQ QUERIES

153

q . WhereIn<s t r i n g >(o => o . Employee ,


new s t r i n g [ ] { employees / 1 , employees / 2 } ) ;
@in<Employee >:( employees / 1 , employees / 2 )
You can nd the full reference for how the DocumentQuery work in the online
documentation. But note that it is very strongly recommended that youll not
try to generate the query strings yourself. Using the API is much simpler, and
it takes care of quite a lot of things behind the scene for you. For example, in
Listing 7.11, you can see the comparisons to Freight. That is a decimal eld,
and we handle sending comparison by using the _Range eld9 and by specifying
the value as Dx5, which indicate that this should be treted as a decimal value.
Beyond the query options, the DocumentQuery API also contain many
query options. You can ask the query to provide detailed timing information using ShowTiming, or to explain why the ranking of the documents
using ExplainScores. You can gain ne grained control of the query using
OpenSubClause and CloseSubClause or AndAlso and OrElse, or specifying
ordering using AddOrder.
But the most interesting options features are used in static indexes, spatial
queries, highlights and facets are all available through the DocumentQuery. But
those are covered in the next chapters.

7.3.1 Property Paths


In dynamic queries, the property path that we use tell RavenDB how to nd
the data we need to look at. The simple property paths looks like the following
snippet:
Company : companies /1
Employee : employees /1
Those refer directly to properties on the document. But more complex paths
are possible when we want to refer to nested objects, as shown below:
ShipTo . City : London
ShipTo . Country : I s r a e l
As expected, access a nested property is done using the familiar dot notation.
But what happens when we want to query inside a collection? Because that
require a very dierent handling, very early on, we use a dierent syntax to
handle this, the comma notation:
Lin es , P r i c e : 15
Lin es , Product : p r o d u c t s /84
9 Well

deal with those in a later chapter

154 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES


The comma operator is used to indicate that we are actually going into a collection. Of course, you can mix those up. If the order line contained an Address
object of its own, we could do:
Li n es , ShipTo . City : London
Which tells RavenDB that rst need to go to the Lines collection, fetch the
nested property City from the ShipTo object, then compare that to the value.
What actually happen is that the query optimizer is going to run through this
syntax, generate the appropriate index if needed, transform the query and then
run that query against the index.
Note that this syntax applies only for dynamic indexes. It doesnt aect querying of static indexes.

7.4 Summary
In this chapter, we started by taking the role of the query optimizer and seeing
all the things that it is doing. From managing queries to generating indexes on
the y to removing indexes on the y and in general taking care of our indexes
in general.
The query optimizer takes care to do this in a way to result in an optimal system,
but merging common indexes and remove old and unused ones. After going over
the query optimizer details, we go down to business and looked at how dynamic
queries actually worked.
We queried simple properties, and then values stored inside collections, then we
looked at the type of queries that can be made using dynamic queries. And
that cover quite a lot of ground. We can do equality and range comparisons,
compare a value to a list to nd if it is in that or create complex queries by
combining multiple clauses.
Following queries, we moved to what we can do with the data they give us, and
we looked into using Include in queries. This is where they really shine, since
you can drastically reduce the number of remote calls that you have to make.
Even that wasnt enough for us, so we looked into reduce the amount of data we
send over the wire by using projections, and when that isnt enough, we have
transformers in our toolbox as well.
Using transformers we can pull data from multiple documents, if needed, and
get just the data that we need. This give us a way to ne tune exactly what we
need, reducing both the number of remote calls and the size on the wire.
Finally, we looked at how queries are actually implemented. We inspected
DocumentQuery and saw that we can use it for fully dynamic queries, without
requiring any sort of predened type, and that this is what the RavenDB LINQ

7.4. SUMMARY

155

API is actually using under the covers. We looked at the result of such queries,
using the Lucene syntax and then explored how the query optimizer is using
specic property path syntax to know how to get to the data.
Coming next, well start learning about static indexes in RavenDB, which is
quite exciting, since this is where a lot of the really cool stu is happening.

156 CHAPTER 7. THE QUERY OPTIMIZER AND DYNAMIC QUERIES

Chapter 8

Static indexes
In the previous chapter, we talked about dynamic indexes and the query optimizer. For the most part, those serve very well to free you from the need to
manually deal with indexing. Pretty much all of the standard queries can be
safely handed over to the query optimizer and it will take care of that.
That leave us to deal with all of the non standard queries. Those can be full
text search, spatial queries, querying on dynamic elds or any real complexity.
That isnt to say that dening static indexes in RavenDB is complex, only that
they are required when the query optimizer cant really guess what you wanted.
By the end of this chapter, youll know how to create static index in the stuido
and in your code, how to version them in your source control system and how
to make RavenDB sit down, beg and (most importantly) go and fetch the right
documents for you.

8.1 Creating & managing indexes


Well start with the simplest possible index, just nding companies by their
country. Go to the Northwind database in the Studio, and then to the indexes
tab, create a new index, name it Companies/ByCountry and then ll the map
eld with the following index denition:
from company i n d o c s . Companies
s e l e c t new { Country = company . Address . Country }
You can now save the index. RavenDB will now run all the documents in
the Companies collection and push them through the index dention. For the
companies/1 document, the output of this index would be a single eld Country,
which has the value Germany. We have gone on exactly how this work in the
previous chapters.
157

158

CHAPTER 8. STATIC INDEXES

This kind of index isnt really useful, of course. We are usually better o letting
the query optimizer handle such simple indexes, rather than write such trivial
indexes ourselves. We demonstrate with such an index because it allows us to
talk about the mechanics of working with indexes without going into any extra
details.
With RavenDB, you dont have to dene a schema, but in many cases, your code
will expect certain indexes to be available. How do you manage your indexes?
One way of doing that is to do so manually. You can either export just the
index dentions from the development server and import them to the production
server. Another would be to use the copy/paste index dention from one server
to the next. Or you can just jot it down on using pen & paper and remember
to push those changes to production.
Of course, all those options have various issues, not the least of which is that they
are manual processes, and that they dont tie down to your code and the version
that it is expecting. This is a common problem in any schema management
system. You have to take special steps to make sure that everything is in sycn,
and you need to version and control your changes explicitly.
That is the cause for quite a lot of pain. When giving talks, I used to ask people
how many of them ever managed to get a failed deployment because they forgot
to run the schema changes. And I got a lot of hands raising, every single time.
With RavenDB, we didnt want to have the same issues. That is why we have
the option of managing indexes in your code.

8.2 Dening indexes in your code


So far, we worked with indexes only inside the studio. That is a great way
to work with them, but it doesnt allow us to manage them properly. Most
importantly, it is possible for indexes to go out of tune with our codebase. That
is why RavenDB allows to dene indexes in code. Listing 8.1 shows the same
index as before, but now dened as C# code inside our solution.
Listing 8.1: Dening index in code
p u b l i c c l a s s Companies_ByCountry : A b s t r a c t I n d e x C r e a t i o n T a s k <Company>
{
p u b l i c Companies_ByCountry ( )
{
Map = companies =>
from company i n companies
s e l e c t new { company . Address . Country } ;
}
}
There are a few things to note about Listing 8.1.

8.2. DEFINING INDEXES IN YOUR CODE

159

The class name is Companies_ByCountry, by convention, well turn that


into an index named Companies/ByCountry (since we cant use slash as
a C# class name).
The class inherits from AbstractIndexCreationTask, which marks this
class as an index that operates on the Companies collection.
The value of the Map property is the actual index dention that will be
sent to the server.
Note that this is just the index dention. Having this in your code doesnt do
anything. In order to actually create the index, you need to execute it, like so:
new Companies_ByCountry ( ) . Execute ( documentStore ) ;
This will create the index on the server. But you dont want to have to remember
this for each and every index that you have. A far more common way to handle
this is to ask RavenDB to just handle it all for you:
var asm = t y p e o f ( Companies_ByCountry ) . Assembly ;
I n d e x C r e a t i o n . C r e a t e I n d e x e s ( asm , documentStore ) ;
Usually, this is done as part of initializing the document store, so all the indexes
in the assemblies will be picked up and created automatically. This frees you
from having to deal with manually managing the indexes. If you create or
modify an index, it will create the index on the server automatically. During
development, this is the prefered mode for working. Yoy modify the index, hit
F5, and the index is updated on the server side. If the server side index denition
matches the index denition on the client, the operation has no eect.
The indexes are dened in your code, so they are versioned and deployed with
your code. That relieves you from handling index updates manually. This
dramatically reduce the amount of eort and pain that you have to go through
for deployments.
Locking indexes
Sometimes you need to make a change to your index denition on
your live server. That is possible, of course, but you have to be
aware that if you are using IndexCreation to automatically generate
your indexes, the next time your application will start, it will reset
the index denition.
That can be somewhat annoying, because changing the index denition on the live server can be a hotx to solve a problem or introduce
a new behavior, and the index rest will just make it go away, apparantely randomly.
In order to handle this, RavenDB allows the option of locking an
index. An index can be unlocked, locked (ignore) and locked (error).
In the unlocked mode, any change to the index would be accepted

160

CHAPTER 8. STATIC INDEXES


and if the new index dention is dierent than the one stored on the
server, the index would be updated.
In the locked (ignored) mode, a new index dention would appear to
complete successfully, but will not actually change anything on the
server. And in the locked (error) mode, trying to change the index
will raise an error.
Usually youll just mark the index as locked (ignore), which will
make the server ignore any changes to the index. The idea is that
we dont want to your calls to IndexCreation by throwing an error.
Note that this is not a security measure, it is a way for the operation
team to make a change in the index and prevent the application from
mindlessly setting it back.

Now that we know how to work with indexes in our code, we need to upgrade
from writing trivial index to writing useful stu.

8.3 Complex queries and doing great things


with indexes
RavenDB doesnt allow any computation during the query. This is done to
ensure that all queries in RavenDB can be easily translated into an operation
on an index. That, in turn, means that all the queries in RavenDB have very
high ecency. Let us look at what looks like a simple example. In the Northwind
database, I want to nd all of the big orders, sorted by date. A big order is an
order for more than $10,000.
Using SQL, here is how I can express such a query:
SELECT * FROM Orders o
WHERE (
SELECT
SUM( ( U n i t P r i c e * Quantity ) * ( 1 D i s c o u n t ) )
FROM [ Order D e t a i l s ] AS od
WHERE o . OrderID = od . OrderID
) > 10000
Let us consider the amount of work that the database engine has to do to process
such a query. First, it has to scan all the rows in the Orders table, and then sum
up all the order details for each order. Only after it has the full information for
each order, can the database engine decide whatever it needs to output the row
or discard it. On small data sets, that works pretty well. On larger data sets
well, that isnt really going to work.
With RavenDB, we can try running the following query:

8.3. COMPLEX QUERIES AND DOING GREAT THINGS WITH INDEXES161


s e s s i o n . Query<Order >()
. Where ( x => x . L i n e s
. Sum( l => l . D i s c o u n t * l . P r i c e P e r U n i t * ( 1 l . D i s c o u n t ) ) > 1 0 0 0 0 )
. ToList ( ) ;
This looks pretty good, right? But trying to run this would result in an error
saying that the linq provider cant understand how to translate it, or that it
is not possible to perform computation during queries. That seems to be a
pretty harsh limitation, doesnt it? How can we solve the problem of nding
the expensive orders?
While we cannot execute computations during query, we are absolutely able to
run them during indexing. Let us look at Listing 8.2, to see just such an index.
Listing 8.2: Computation during indexing
p u b l i c c l a s s Orders_Totals :
A b s t r a c t I n d e x C r e a t i o n T a s k <Order , Orders_Totals . R e s u l t >
{
public c l a s s Result
{
p u b l i c double Total ;
}
p u b l i c Orders_Totals ( )
{
Map = o r d e r s =>
from o r d e r i n o r d e r s
s e l e c t new
{
T o t a l = o r d e r . L i n e s . Sum( l =>
( l . Quantity * l . P r i c e P e r U n i t ) * ( 1 l . D i s c o u n t ) )
};
}
}
The index in Listing 8.2 can then be queried using the code in Listing 8.3.
Listing 8.3: Querying on a computed index eld
s e s s i o n . Query<Orders_Totals . R e s ul t , Orders_Totals >()
. Where ( x => x . T o t a l > 1 0 0 0 0 )
. OfType<Order >()
. ToList ( ) ;
What would the query in Listing 8.3 do? Instead of having to force the database
engine to run through all of the orders, the index denition already computed
the total during the indexing. That means that we only have to do a seek
through the index to nd the relevant orders, which would be very fast.

162

CHAPTER 8. STATIC INDEXES

By moving all computation from query time (expensive, often require to run
through the entire data set on every query) to indexing time (cheap, happen
only when a document changes), we are able to change the costs associated with
queries. Even complex queries always end up being some sort of a search on the
index. That is one of the reasons why we usually dont have to deal with slow
queries in RavenDB. There isnt anything that would cause them to be slow.

8.3.1 The many models of indexing


In Listing 8.3, we have seen how we can query an index that uses a computation.
That syntax is quite strange, because there seems to be a lot of types being
bandied about. This is because there are mutliple models that are being used
here, and in this case, they are all dierent.
Indexing and querying in RavenDB is done using the following models:

The
The
The
The

documents to be indexed.
index entry is the output from the indexing function.
query model is how we can query on the index.
query result is what is actually returned from the index.

This is a bit confusing, but it can be quite powerful. Let us see why we have so
many models, rst.
The dierence between documents and index entries is obvious. We can see the
dierence quite clearly in Listing 8.2. The document doesnt have a Total eld,
and the index is computing that value and output that eld to the index. Thus,
the index entry for orders/1 has a Total eld with a value of 440.
Because we have this dierence between the document that was indexed and
the actual index entry, we also have a dierence between how we query and
what is actually returned. Look at Listing 8.3. We start the query using:
Query<Orders_Totals.Result, Orders_Totals>().
The rst generic parameter Orders_Totals.Result is the query model, and the
second is the index to use. The query model is usually also the query result,
since most times this is the same thing. In this case, however, we need to query
on the Total eld, which does not exist on the document.
As we discussed in Chapter 6, the way queries work in RavenDB is that we
rst run the query against the index, which gives us the match index entries.
We then take the __document_id property from each of those matching index
entries and use that to load the relevant documents from the document store.
This is exactly what is going on in Listing 8.3. We start by using the query model
Orders_Totals.Result, but what we are actually getting back from the server
is the list of matching orders. Because of that, we need to explicitly change

8.3. COMPLEX QUERIES AND DOING GREAT THINGS WITH INDEXES163


the type that the Linq query will use. We do that using the OfType<Order>
method.
Note that the OfType<Order> method is actually purely for the client side.
This is required solely so that the client can understand what type it is going
to get. It has no impact on the server.
To make things more interesting, the index entry and the query model arent
one and the same. Let us look at Listing 8.4 for one such example.
Listing 8.4: Index using dierent index entry and query models
p u b l i c c l a s s Orders_Products :
A b s t r a c t I n d e x C r e a t i o n T a s k <Order , Orders_Products . R e s u l t >
{
public c l a s s Result
{
p u b l i c s t r i n g Product ;
}
p u b l i c Orders_Products ( )
{
Map = o r d e r s =>
from o r d e r i n o r d e r s
s e l e c t new
{
Product = o r d e r . L i n e s . S e l e c t ( x => x . Product ) . ToArray ( )
};
}
}
As you can see, there is a (small) dierence between the output of the index
dention (Product is an array of strings) and the shape of the index result
(Product is just a string). The reason for the dierence between the model
relates to how we can query them. When RavenDB encounters a collection in
the index entry elds, it is actually indexing that eld mulitple times.
If we look at orders/1 document, the output of the index in Listing 8.4 would
be:
Product : [ p r o d u c t s / 1 1 , p r o d u c t s / 4 2 , p r o d u c t s / 7 2 ]
However, when we want to query, we dont treat this as a collection. Instead, we
treat it as a single eld that has multiple values. In other words, the following
query will give me all the orders for products/11:
s e s s i o n . Query<Orders_Products . R e s ul t , Orders_Products >()
. Where ( x=>x . Product == p r o d u c t s / 1 1 )
. OfType<Order >()
. ToList ( ) ;

164

CHAPTER 8. STATIC INDEXES

We arent treating this as a collection, but just as a simple value. This seems
silly. Why go into all this trouble and introduce a whole new model when we
could have just called Contains() and call it a day?
This behavior can be very helpful when you have a more complex index, such
as Listing 8.5.
Listing 8.5: Index for searching employees by name or territory
p u b l i c c l a s s Employees_Search :
A b s t r a c t I n d e x C r e a t i o n T a s k <Employee , Employees_Search . R es u l t >
{
public c l a s s Result
{
p u b l i c s t r i n g Query ;
}
p u b l i c Employees_Search ( )
{
Map = employees =>
from employee i n employees
s e l e c t new
{
Query = new o b j e c t [ ]
{
employee . FirstName ,
employee . LastName ,
employee . T e r r i t o r i e s
}
};
}
}
In Listing 8.5, the output of the index for employees/1 would be:
Query : [ D a v o l i o , Nancy , [ 0 6 8 9 7 , 1 9 7 1 3 ] ]
Note that we dont have a simple array, but an array that contain strings and
and an array of strings. We cant just call Contains() on the Query eld. But
because RavenDB will atten out collections, we can query this index using the
following code, and well get the employees/1 document:
s e s s i o n . Query<Employees_Search . R e s u l t , Employees_Search >()
. Where ( x=>x . Query == 0 6 8 9 7 )
. OfType<Employee >()
. ToList ( ) ;
This ability can be extremely useful whenever we want to use an index for
searching. Well cover this in a lot more on the next chapter.

8.3. COMPLEX QUERIES AND DOING GREAT THINGS WITH INDEXES165


Having so many models can be confusing, but usually all of them are the same
model, so you can pretty much ignore this behavior. It is only when we start
getting to interesting stu that youll nd that this is becoming really interesting.
Well see a lot more interesting ways to play with the shape of the data that
goes into the index entries later on in this chapter. For now, I want to talk
about the importance of chastity. Um nope, not that. What was that term
oh, yes! Purity!

8.3.2 The purity of indexes


You can do quite a lot in the index, computing the total for an order is a very
simple thing. However, RavenDB indexes require that the indexing function will
be pure. What does this mean?
A pure function always evaluates the same result value given the same argument
value(s), and has no observable side eects. The indexing function in Listing
8.2 is pure. Given the same input, it will always generate the same output, and
it doesnt modify anything else.
A good example of an a function that isnt pure is one that is using DateTime.Now. Because running it multiple times will generate a dierent value,
depending on what the current time is, the function isnt pure. In theory, there
isnt really a big issue about such an index. We can certainly make it work
The problem is what it means. A lot of the time, when users have used DateTime.Now in an index, it was related to age. I want an index for the users who
logged in during the last hour. And the index was:
from u s e r i n d o c s . U s e r s
where ( DateTime . Now u s e r . L a s t L o g i n ) . TotalHours < 1
s e l e c t new { u s e r . Id , u s e r . Name }
The problem with such an index is that it looks okay, but what it actually does is
dierent from what the user expected. This only counted the users who logged
in within an hour of being indexed. After the document has been indexed, there
was no need to reindex them, so they remained in the index, and caused quite
a bit of confusion.
Because of that, using DateTime.Now or any equivalent function is blocked in
RavenDB. A much better index would be:
from u s e r i n d o c s . U s e r s
s e l e c t new { u s e r . Id , u s e r . Name , u s e r . L a s t L o g i n }
Which we can then query for all the users whose last login is greater than an
hour ago. This also gives us the exibility to check users that logged in a day
ago, or just fteen minutes ago, etc.

166

CHAPTER 8. STATIC INDEXES

So far, we looked at simple indexes. An index that have only a single map
function in the index denition, and is operating on a single collection. But
RavenDB actually allows us to do much more, using multi map indexes.

8.4 Multi map indexes


Usually an index has a single map function, and it is going to operate on a single
collection. And for the most part, that is a great thing to have. But there is
more that we can do with RavenDB. Multi map indexes allows us to dene more
than a single map function (as you probably expected from the name).
But why would we want to ever do something like that? Using the Northwind
database, we need to search for a particular person. In this database, a person
can be an employee, a companys contact or a suppliers contact. I dont care
about that, all I care about is to nd that guy, Michael. How can we do that
in RavenDB? Let us take a look at Listing 8.6.
Listing 8.6: Multi map index for searching on employees or companies and
supplies contacts
p u b l i c c l a s s People_Search :
AbstractMultiMapIndexCreationTask
{
p u b l i c People_Search ( )
{
AddMap<Company>(companies =>
from company i n companies
s e l e c t new
{
company . Contact . Name ,
});
AddMap<Employee >( employees =>
from employee i n employees
s e l e c t new
{
Name = employee . FirstName + + employee . LastName
});
AddMap<S u p p l i e r >( s u p p l i e r s =>
from s u p p l i e r i n s u p p l i e r s
s e l e c t new
{
s u p p l i e r . Contact . Name
});

8.4. MULTI MAP INDEXES

167

Index ( Name , F i e l d I n d e x i n g . Analyzed ) ;


}
}
What do we have in Listing 8.6? We have an index denition on the
client side, but unlike the previous indexes, this one is inheriting from
AbstractMultiMapIndexCreationTask, instead of AbstractIndexCreationTask.
There is really no real dierence between the two except that AbstractMultiMapIndexCreationTask
allows you to specify multiple maps using the AddMap method, vs. the single
Map property in AbstractIndexCreationTask1 .
Unlike AbstractIndexCreationTask, we dont specify the source for the map
in the class generic parameter. Instead, we use the generic parameter for
AddMap<T>. For now, you can ignore the Index() call, well discuss this in
length on the next chapter.
But this is discussing the structure of the index in Listing 8.6, what is it that
it is doing? This index is going to operate on the Companies, Employees and
Suppliers collections, and it is going to index a single eld, Name in all of them.
Note that in the case of the employees, we are indexing a computed eld, and
in the other two, we are indexing a nested one.
What is important is that in all three cases, we are actually outputing the same
shape to the index. Let us see what we can do with this index, shall we. You
can create it using the following code:
new People_Search ( ) . Execute ( documentStore ) ;
Then go to the studio, to the Indexes tab and clik on the People/Search index2 .
You are now in the query page for the People/Search index, enter Name: Michael
in the query text and then press the query button. The result is shown in Figure
8.1.
Look at the results. We have three documents here: companies/68, employees/6
and suppliers /17, from three dierent collections. All of them matches our
query. So now that we saw what was going on, it is time to understand what is
actually happening.
A multi map index is just like any other index, the only dierence is that it
will cover multiple collections. It goes through the same exact stages. It accept
documents to be indexes, but instead of having just a single map function, it
has several. Usually, one for each collection that you want to include.
The nearest parallel you are probably familiar with is the notion of a union in a
relational database. But unlike such a union, we arent doing this during query
1 You

can use AbstractMultiMapIndexCreationTask to create indexes with a single map.


People_Search on the client side is translated to People/Search on the server
side for index and transformer names.
2 Remember,

168

CHAPTER 8. STATIC INDEXES

Figure 8.1: Results of querying the People/Search index for Michael


time, we are merging the indexing results from multiple collections during index,
and into the same index. That allows you to do some pretty cool stu (which
well explore more therally after we learn about map/reduce), but even at this
point, Im sure that you can see how useful such an ability can be.
If youll take another look at Figure 8.1 youll see that we get three documents
from three dierent collections. How are we going to work with that on the
client side?

8.4.1 Multi map indexes from the client perspective


RavenDB itself doesnt care about schema or types or any such thing, but
because we are working with a strongly typed language, we need to be to get
the result of queries on the People/Search index and do something with it. We
can just issue a weakly typed query, such as:

L i s t <o b j e c t > r e s u l t s = s e s s i o n . Advanced . DocumentQuery<o b j e c t , People_Sear


. WhereEquals ( Name , Michael )
. ToList ( ) ;
We can then run through the results and cast them to the right types. That
works, but it is ugly and quite uncomfortable. We can do better, by asking
RavenDB to give us a common shape back. That can be if the types share a
common ancestor, but a lot of the time, that isnt the case. We want to just
get data from several unrelated documents and do something with it. Typically,
show it to the user.
The easiest way to do that is to use projections.

8.5. PROJECTIONS

169

8.5 Projections
Projections are a way to collect several elds from document, instead of working
with the whole document. In the case we have now, we already know the shape
of the data that we want to deal with. It is the same common shape already
dened in the index. The multi map index enforce that all map functions will
have the same output, and we can use that when the time comes to query the
index. In the case of the People/Search index, that means that we are going
to be showing a list of names, the type of the document and its id. Listing 8.7
shows the changes required to make this work.
Listing 8.7: Multi map index that allows projections
p u b l i c c l a s s People_Search :
AbstractMultiMapIndexCreationTask<People_Search . R e s u l t >
{
public c l a s s Result
{
p u b l i c s t r i n g Type { g e t ; s e t ; }
p u b l i c s t r i n g Id { g e t ; s e t ; }
p u b l i c s t r i n g Name { g e t ; s e t ; }
}
p u b l i c People_Search ( )
{
AddMap<Company>(companies =>
from company i n companies
s e l e c t new
{
company . Id ,
company . Contact . Name ,
Type = Company
});
AddMap<Employee >( employees =>
from employee i n employees
s e l e c t new
{
employee . Id ,
Name = employee . FirstName + + employee . LastName ,
Type = Employee
});
AddMap<S u p p l i e r >( s u p p l i e r s =>
from s u p p l i e r i n s u p p l i e r s
s e l e c t new
{

170

CHAPTER 8. STATIC INDEXES


s u p p l i e r . Id ,
s u p p l i e r . Contact . Name ,
Type = S u p p l i e r
});
Index ( Name , F i e l d I n d e x i n g . Analyzed ) ;
S t o r e ( x => x . Id , F i e l d S t o r a g e . Yes ) ;
S t o r e ( x => x . Name , F i e l d S t o r a g e . Yes ) ;
S t o r e ( x => x . Type , F i e l d S t o r a g e . Yes ) ;
}

}
The rst thing to note is that we now dene an inner class called Result, and
reference that as the generic argument. The generic argument doesnt have to
be an inner class, but that is a common convention, because it ties the index
and its exepcted result together. We have seen used before for queries, but now
we are going to use this for getting the relevant data:
var r e s u l t s = s e s s i o n . Query<People_Search . R e s u l t , People_Search >()
. Where ( x=>x . Name == Michael )
. P r o j e c t F r o m I n d e x F i e l d s I n t o <People_Search . R e s u l t >()
. ToList ( ) ;
What is going on in this piece of code? First, we tell the RavenDB Linq Provider
that well be querying the People/Search index, and that well be using the
People_Search.Result class as the query model. Then we actually specify the
query, and nally, we ask RavenDB to project the result from the index. What
does this mean?
Look at the last lines of Listing 8.7, we have a few Store() calls there. Usually,
the index doesnt bother to keep around any data beyond what it needs to
actually answer a query. But because we told it to store the information, it
wont only index the data, but also allow us to retreive the information directly
from the index.
Usually queries in RavenDB have the following workow:

Execute the query against the index


Gather all the matching document ids
Load the documents from the document store
Return documents to client

When we use projections, the workow is dierent, instead of getting just the
document ids from the index, well also get the projected elds. That is why we
have to store them in the index. That means that well query the index, load
the results directly from it, and immediately return to the client.

8.6. LOAD DOCUMENT

171

And in this case, well return a list of (Id, Name, Type) to the client, and the
client can show it to the user, who will perform any additional actions on them.
What happens if we project a non stored eld?
If you project on a eld that isnt stored, RavenDB will load the
document from the document store, and then get the eld from the
document directly. You get the correct result, but you incur the cost
of actually loading the document from disk, although it wont be
sent over the network. For small documents, that is hardly anything
major, but for large documents, that might be something that you
want to pay attention to.
Projections arent limited to multi map indexes, you can use them in any index,
and they are frequently quite useful. They also go hand in hand with transformers, which also allows you to limit the amount of data that you get from the
server. A transformer running on an index will rst try to nd a stored eld,
and only if it cant nd the stored eld will it load the document.
Because projections are usually based on the data in the index, they are subject
to the same staleness consideration. If you load the data directly from the
document (whatever by loading the document itself or by projecting from the
document) you are ensured that youll always have the latest version at the time
of the query. If you are projecting data stored in the index, it is possible that
the data on the document has changed since then.
You dont have to store all the elds in the index, you usually store just the
elds that you care to project out. Linq queries such as the following one are
also using projections under the cover:
from o r d e r i n s e s s i o n . Query<Order >()
s e l e c t new
{
o r d e r . Company ,
o r d e r . OrderedAt ,
}
This query will project the Company and OrderedAt elds, such queries are the
reason why we fallback to the document if the index doesnt already store those
elds.

8.6 Load Document


Uncle Ben said that with great power comes great responsability. My version
of that is that great features can also produce the most headache. The Load
Document feature is a really awesome one, but it is also something that should

172

CHAPTER 8. STATIC INDEXES

be used carefully. It isnt meant to be the hammer that youll use to beat every
problem into submission. Scared yet?
I do admit that this is quite an introduction for a feature that we havent even
discussed. But we have had a lot of problems with users tripping themselves over
this feature, and most this has been because they came to it from a relational
mindset. So I would ask that youll read this section to understand the technical
nature of this feature, but refrain from using it until you have read the Part
III - Modeling and can understand the proper way to design a document based
modeling with RavenDB.
With this ominous introduction out of the way, let us see what this feature is
all about. And it is a really cool feature. Load Document allows you to load
another document during index, and use its data in your index. Let us take a
simple example, we want to search for product by its category. You can see how
product store the category information in Figure 8.2.

Figure 8.2: The products in the Northwind database


This means that it is trivially easy to run a query asking: Give me all the
products in the categories/1 category. However, what would happen if we
wanted to ask: Give me all the products in the Beverages category? Well,
that is a much harder query to answer. In fact, using what we have done so far,
we cant answer it.
What well have to do is to rst make a query to nd the category ids of all the
categories whose name is Beverages, then query the products in those categories.
That works, and it is in fact usually recommended, but Load Document allows
us to have another option, as Listing 8.8 shows.

Listing 8.8: Using LoadDocument in an index


p u b l i c c l a s s Products_Search : A b s t r a c t I n d e x C r e a t i o n T a s k <Product , Products
{
public c l a s s Result
{
p u b l i c s t r i n g CategoryName { g e t ; s e t ; }
}
p u b l i c Products_Search ( )
{

8.6. LOAD DOCUMENT

173

Map = p r o d u c t s =>
from p r o d u c t i n p r o d u c t s
s e l e c t new
{
CategoryName = LoadDocument<Category >( p r o d u c t . Category ) . Name
};
}
}
What is going on in Listing 8.8? We are calling LoadDocument<Category>(product.Category)
in the map, loading the relevant category and indexing its name. That means
that we can now query for the products in the Beverages category using the
following code:
var r e s u l t s = s e s s i o n . Query<Products_Search . R e s u l t , Products_Search >()
. Where ( p => p . CategoryName == B e v e r a g e s )
. OfType<Product >()
. ToList ( ) ;
As we discussed previously in this chapter, because we have a dierent model for
the shape of the index and the shape of the result, we start the query with the
Products_Search.Result and then use OfType<Product> to change the Linq
result to the appropriate returned type.
Now, what is actually going on here? Let us consider the case of the products/70
document, being indexed by the Products/Search index. During index, we
call the LoadDocument method, which will then fetch the categories /1 and
index the category name in the index for the products/70 document. Now
we can search for CategoryName: Beverages, and get (among others) the
products/70 document.
So far, so good, and a really useful feature. But let us get down to the nitty gritty
details. Because this happens during indexing, what happens if the relevant
category is null, or if it it has a value in that eld is of a non existant document
id? Like any other null handling in RavenDB handling, this is handled for you,
and you dont need to write null checks or guard against that.
A more interesting problem happens when we deal with changes. Given the
previously mentioned products/70 and its associated categories/1. The index
entry for the products/70 looked like the following:
{ CategoryName :

Beverages , __document_id :

products /70 }

What happens when we are update the categories/1 document? For example,
to change the name from Beverages to Drinks? Documents are only re-indexed
when they are changed. So we would expect that since products/70 didnt
change, we wont have this index entry update, and it will reect the state of
the data at the time of the products indexing.

174

CHAPTER 8. STATIC INDEXES

RavenDB is smart enough to recognize that this would turn this feature into a
curiosity, nothing more. Because of that, RavenDB create an internal association between the two documents. You can think about it as an internal table
that associates products/70 with categories/1. Whenever we put a document,
we check those associations, and we touch all the referencing documents, which
will force their re-indexing.
Who touched my documents?
Touching a document usually happens whenever we detect that a
document that has been referenced by this document during index
has changed, so the document need to be reindexed (to pick up the
new values from the referenced document). Touching a document
involves updating its etag value, which will force it to be re-indexed.

8.7 The dark side of Load Document


So far, I gave a grave warning, and then proceeded to show how cool it is. Why
the pre-empative strike against this feature?
The problem with Load Document is that is allows users to keep a relational
model when they work with RavenDB, and use Load Document to get away with
it when they need to do something that is hard to do with RavenDB natively.
That wouldnt be so bad, if Load Document didnt have several important costs
associated with it.

8.7.1 The commonly referenced & updated document


In one memorable case, a customer was complaining about very high CPU usage
and constant work. Their usage didnt match their load. After investigating, we
realize that they had a very large collection (98%+ of their documents) which
had an index similar to this:
from o r d e r i n d o c s . Orders
s e l e c t new
{
// l o t o f o t h e r s t u f f
S t a t u s = LoadDocument ( c o n f i g / g l o b a l ) . S t a t u s [ o r d e r . S t a t u s I d ]
}
That means that for all the order documents in the database, we had an association to the cong/global document. That is bad enough, but this cong/global
document was also update every 15 minutes by a background task. Im sure
that you can gure out where that led.

8.7. THE DARK SIDE OF LOAD DOCUMENT

175

Every 15 minutes, the cong/global document would be updated, forcing us


to touch all of the orders documents which referenced it. That was expensive
enough, but it gets worse. Because we touched all those documents, we now need
to index them. Eectively, they managed to create a denial of service attack
against RavenDB. Things actually worked for them, to the point where this was
a of concern issue rather than a sky is falling issue because RavenDB was
able to adapt to the load by shifting resources around and still serve requests
and process indexes.
That is a patalogical case, admitedly, but it does show a problem. Basically, the
cost of updating a document is directly proportional to how many documents
are referecing it. And we had cases where the number of documents referencing
a document was big enough so updating it became practically impossible.
Recommendation: Use LoadDocument carefully
For the most part, properly modeling your domain saves you from
the need to use Load Document. But when you use it, you need
to consider the implications. In particular, make sure that the references you are creating arent going to cause a single document to
be referenced by a very large number of documents. Usually, that
indicate a weakness in your domain, which should be addressed.

8.7.2 Load Document costs and heuristics


The Load Document associations are kept on a per index basis, so deleting
or updating an index denition is also directly proportional to the number of
association in the index. Usually, this doesnt matter, since index denition
deletion or updates are async, but that is something to be aware of for the I/O
costs involved.
If that isnt enough for you, Load Document completely messes up all heuristics
RavenDB is using to optimize indexing. Because the Load Document is opaque
for RavenDB, whenever you call it, we have to load the document from disk.
RavenDB goes to great lengths to avoid having to do that on a syncronous
code path, but Load Document doesnt give us another option, we are given a
document id and we need to load it. There are things that we can do (caching of
loaded documents in the same batch, for example), but for the most part, Load
Document force us to stop indexing, load the document, and resume indexing.
If you have a large number of Load Document calls, that can signcantly slow
down your indexing time. This usually doesnt show up during normal operation
(where the number of documents indexed is relatively small), but it is an issue
if you are updating an index denition and need to index a large number of
documents (and call Load Document a large number of times).

176

CHAPTER 8. STATIC INDEXES

The call of Load Document is a tempting one, Come to Dark Side, we have
cookies. And it is a very powerful feature, but it is something that you should
be using after carefully understanding the implications of that.
The most common usage of Load Document unfortunately is when users model
documents in RavenDB in the same manner that they did when using a relational database, failing to understand that a very dierent database engine
require a dierent approach. Well discuss this in great length in the modeling
part of the book.

8.8 Queries
We have gone over a lot of the details in static indexes, and we are almost done
with this topic. Before we go on to talk about full text search and map/reduce,
I wanted to go over potential queries options in RavenDB. This isnt meant to
be an exhaustive list (see the documentation for that), but it should give you a
good idea about the kind of querying capabilities that RavenDB has.
A lot of that is probably obvious, and you can skim through the rest of this
section without losing much.

8.8.1 Equality and comparisons


Probably the easiest to explain is the notion of equality. RavenDB support
equality in queries using the following:
var r e s u l t s = from emp i n s e s s i o n . Query<Employee >()
where emp . FirstName == Steven
s e l e c t emp ;
By default, RavenDB compare value as strings, so when we do something like
this: Where(x=>x.IsActive) we are actually sending the following query to the
server: IsActive : true. You dont need to worry about the cost of string equality
checks. We arent actually comparing the values. What we do is scan through
a sorted index to get the relevant details out. Unless you specied dierently
(discussed in the next chapter), RavenDB will use case insensitive comparison
to do this match.
This is important because we had questions in the mailing list where people
dened a comparison operator or a overrode Equals and were surprised when
RavenDB didnt pick up on that. Queries are always executed on the server,
and they dont involve any code written by the user.
Comparisons (greater than, less than or greater than or equals, less than or
equals) are also pretty obvious:

8.8. QUERIES

177

var r e s u l t s =
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . F r e i g h t >= 25
s e l e c t order ;
Those, too, are using the index rather than compare all the values. Note that
we have implemented special behavior for numerics, so we can compare them
without incuring 3 is greater than 10 issues when doing lexical comparisons of
numbers.

8.8.2 Query clasues


Queries arent limited to single property equality or comparison. You can create
complex queries using OR and AND. Those behave as you expect them to, as
the following query demonstrates:
var r e s u l t s =
from o r d e r i n s e s s i o n . Query<Order >()
where ( o r d e r . F r e i g h t > 25 && o r d e r . F r e i g h t < 5 0 ) | | o r d e r . ShippedAt == n u l l
s e l e c t order ;
This linq statement is translated to the following query:
Freight_Range : { Dx25 TO Dx50} OR ShippedAt : [ [ NULL_VALUE ] ]
You can see that we query on the Freight_Range eld, this is an automatically
generated eld, that holds the numeric value of the Freight eld in a way that allow su to do ecent range seraches. The order.Freight > 25 && order.Freight < 50
was translated to an ecent between operation: Freight_Range:{Dx25 TO Dx50}
and the OR between the two clauses is there as well.
The null comparison is a bit interesting, since we need to compare to something
in the index, we use the value of [[NULL_VALUE]] as a sentinal null value.
You can read more about how this work in the documentation about the full
query syntax that RavenDB uses.

8.8.3 Prex and postx searches


RavenDB also supports prex searches (StartsWith), this is done by the following linq statement:
var r e s u l t s =
from emp i n s e s s i o n . Query<Employee >()
where emp . FirstName . S t a r t s W i t h ( Marg )
s e l e c t emp ;

178

CHAPTER 8. STATIC INDEXES

This is translated to the following query: FirstName: Marg*. The same can
be done in reverse, by using EndsWith. However, that is no advisable. Using
prex query, we can make a good use of our indexes to ecently answer the
query. The same is not true for EndsWith, that requires us to scan the entire
index. A better alternative if you need to query using EndsWith is to create
an index with the reversed value and use the much more ecent StartsWith on
that.

8.8.4 Contains, in and nested queries


Queries in RavenDB oer a few additional operations that you can take advantage of. Starting with the Contains. Here we need to make a distinction between
Contains on a list and Contains on a string. Let us examine the following query:
var r e s u l t s = from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . ShipTo . L in e1 . C o n t a i n s ( Acorn )
s e l e c t order ;
Trying to execute this query will result in the following error:
Co n t ai ns i s not supported , d o i n g a s u b s t r i n g match o v e r a t e x t
f i e l d i s a v e r y s l o w o p e r a t i o n , and i s not a l l o w e d u s i n g t h e
Linq API .
There are far better ways to handle such a scenario, as well see in detail in
our next chapter. But Contains isnt limited to just substring searches, you can
search for an item in a list. Finding an employee who is responsible for a certain
territory, for example:
var r e s u l t s = from employee i n s e s s i o n . Query<Employee >()
where employee . T e r r i t o r i e s . C o n t a i n s ( Asia )
s e l e c t employee ;
This will generate Territories :Asia as the query and will nd all the relevant
employees. We can do this because at indexing time, we can feed the Territories
array into the index and perform a highly ecent search in the index to nd all
the employees that have Asia in their territories.
What about the reverse of Contains? What is we arent interested in searching
inside a list, but searching for a match from a list. What if I want to nd all
the companies in Spain, Denmark or Switzerland.
The Raven.Client.Linq namespace
Many advanced operations are exposed via extentions methods from
the Raven.Client.Linq namespace, and if you are making any use of
interesting queries with RavenDB, you will want to add a using
statement for that namespace.

8.8. QUERIES

179

We can issue a query for those companies using the following code:
var r e s u l t s =
from c i n s e s s i o n . Query<Company>()
where c . Address . Country . In ( Spain , Denmark , S w i t z e r l a n d )
select c ;
And the resulting query would be @in<Address.Country>:(Spain,Denmark,Switzerland).
Note that this is using a RavenDB specic extension to the Lucene query
syntax to provide an ecent search mechanism for potentially large lists.
Finally, the last advanced query operation well deal with in this chapter is Any,
which is used to expressed nested queries. The question we pose is this, nd all
orders that have a line item with a specic product.
var r e s u l t s = from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( x=>x . ProductName == Milk )
s e l e c t order ;
And the generated query is: Lines,ProductName:Milk. Pay attention to the
comma operator in the query. This instructs RavenDB that the Lines property is
a collection, and we are nesting into the properties of the items in that collection.
There is one very important detail that you need to be aware of. How multiple
predicates are treated in this situation. Let us consider the following Linq query:
var r e s u l t s =
from o r d e r i n s e s s i o n . Query<Order >()
where o r d e r . L i n e s . Any( x => x . ProductName == Milk && x . Quantity == 3 )
s e l e c t order ;
It asks for all orders that has a line item with the name Milk and a quantity of
3. However, the generated query would be: Lines,ProductName:Milk AND Lines,Quantity:3.
What eectively happens is that as far as RavenDB is concerned, the query we
issued is:
where o r d e r . L i n e s . Any( x => x . ProductName == Milk ) &&
o r d e r . L i n e s . Any( x => x . Quantity == 3 )
In other words, there is no guarantee that the match on the ProductName is
Milk and the Quantity equaling 3 come from the same line item. Why is that?
In order to understand the reasoning behind this, we need to look at the auto
generated index for this query:
// Auto/ Orders / ByLines_ProductNameAndLines_Quantity
from doc i n d o c s . Orders
s e l e c t new {
Lines_ProductName =

180

CHAPTER 8. STATIC INDEXES


from d o c L i n e s I t e m i n ( ( IEnumerable<dynamic >)doc . L i n e s )
s e l e c t d o c L i n e s I t e m . ProductName ,
Lines_Quantity =
from d o c L i n e s I t e m i n ( ( IEnumerable<dynamic >)doc . L i n e s )
s e l e c t d o c L i n e s I t e m . Quantity

}
In other words, we are attening all the line items into a single index entry. This
is done to avoid a fanout, a situation where a single document outputs a very
large number of index entries. A big fanout can cause RavenDB to consume a
lot of memory and I/O during indexing, so we try to avoid it.
Advanced query operations and dynamic vs. static indexes
Usually, there isnt any dierence between queries made against
a dynamic index or a static index. But as you can see in the
Auto/Orders/ByLines_ProductNameAndLines_Quantity
index,
RavenDB will make specic transformations when making certain
queries (specically, Any queries, using the comma operator). So
a query such as Lines,Quantity:3 will be translated to a query on
the Lines_Quantity eld. If you want to make such queries on your
static indexes, you are probably better o matching the expected
RavenDB naming.
There are conventions that control this behavior that you can tune
ot your own desires, but it is usually much easier to just follow the
same behavior as the rest of RavenDB.
You can dene your own static index, to answer this query, which will have
a separate index entry per line item, allowing you to query the quantity and
product name of a specic line item. In which case you can be explicit about
the number of items that you want to allow in your fanout.

8.9 Summary
In previous chapters, we have dealt a lot with the indexing mechanisms. How
RavenDB index documents, how dynamic queries work, etc. In this chapter, we
have explored the various options that we have when creating our indexes, and
how that aect the sort of queries that we can make.
We started by looking into how we are actually going to create, maintain and
manage indexes in RavenDB, dening indexes in your code allows us to version
them alongside your code, in the same source control system, and using the
same tools.
Afterward, we moved to creating complex indexes and moving the cost of computations from query time to indexing time, one of the ways in which we are

8.9. SUMMARY

181

able to perform super fast queries with RavenDB. We talked about the document model, the indexing model, the query model and the result model, and
how using slightly dierent models allows us to create some pretty impressive
results.
Multi map indexes were the next topic at hand, allowing us to index multiple
collections into a single index and query them in an easy manner. Following
multi map, we discussed projections and the general process of loading the data
from an index and storing information in the index.
We discussed the LoadDocument feature, its stengths and its weaknesses. This
feature allows us to index related documents, but at a potentially heavy cost
during indexing. We concluded this chapter by going over the querying options
that we have, we looked at equality and range comparisons, complex queries,
the cost of prex and postx searches and advanced query operations such as
Contains, In, and performing nested queries using Any.
That is quite a lot to digest, and I recommend that youll spend some time
playing around with the indexing and querying options that RavenDB oers,
to get yourself familiar with the range of tools that it opens up. In the next
chapter, we are going to talk about full text searching, making full use of Lucene
and its capabilities. After that, well go on to talk about map/reduce and what
we can do with that.g

182

CHAPTER 8. STATIC INDEXES

Chapter 9

Full text search


9.1 Inverted Index
9.2 Terms
9.3 Analyzers
9.4 Stop words
9.5 Facets
9.6 Suggestions
expensive

9.7 Complete search example

183

184

CHAPTER 9. FULL TEXT SEARCH

Part III
In this part, well learn about scale out:
Replication
Sharding
Reporting & SQL Replication

185

186

Part IV
In this part, well learn about operations:

Monitoring ?
Endpoints ?
Troubleshooting
Security

187

Vous aimerez peut-être aussi