Vous êtes sur la page 1sur 72

Index Structures

Memory organization
• Various kinds of storage hardware available
• Hard Disk
• DRAM
• SRAM
• Solid state drives
• Generally, faster storage is more costly
• How to achieve the optimum balance between fast storage and
cost?
Memory hierarchy

Our focus

In newer architectures, L2 may also reside on-chip


How to get the block size?
• Sector Size:
• sudo hdparm -I /dev/sda | grep -i physical
• > Logical/Physical Sector size: 512 bytes
• Block Size:
• sudo blockdev --getbsz /dev/sda1>
• 4096
• So, access time is ~12ms to read a sector (1 kb).
• How much time would it take to reach 2 kb?
• Depends: how is it stored? Sequential? Or random?
Random Vs Sequential Disk access
• Random Disk access time

• For sequential access Trotation and Tseek is not incurred


• Speed up: 12/0.02=600 times faster
• Take away point: Avoid random I/Os
Disk Vs Main Memory
• Time to access a byte=60ns for DRAM
• SRAM rakes 4ns to read a byte
• Generally, 10,000,000
6328125
859375

• random accesses are 10k times slower 1,000,000


100,000
103880
30875

$/GB
• Sequential accesses are ~500 times slower 10,000
1,000
1107
189
100 12
6
10
1
Capacity of largest known DRAM Chip 1970 1983 1995 2008 2020
1200000
Timeline
1000000

800000

DRAM is the “new disk”


600000
MB
400000
200000

1975 1985 1995 2005


Time
Binary Search Tree (BST)

• 
Costs?
• Insertion, deletion and searching time:
• Avg. case O(log n)
• Worst case O(n)
• Tree can be unbalanced
• Deeper analysis of costs
• Assume huge volume of data and hence BST is on disk
• Operations:
• Read each node from disk
• DISK I/O
• Decide direction
• CPU
• Bottleneck is Disk I/O
DISK I/O
• 
B-tree
• A B-Tree of order m satisfies the following properties
• All leaf nodes appear at the same level
• Unless the root is a leaf node, it has at least two children
• All nodes other than the root have at least m keys
• All nodes including the root contain at most 2m keys
• A non-leaf node with r keys has r+1 children
• Keys in each node are sorted
• Each key k has a left subtree and right subtree. Like BST, keys in left
subtree are at most as large as k and keys in right subtree are greater
than k
B-tree
• B-tree of order 1
• Searching: Same procedure as in BST
B-Tree: Insertions
• Insert in appropriate node
• If overflow occurs, identify the median
• Split the node with median as the boundary and transfer the
median to the parent node.
• Recurse if parent also overflows
B-Tree: Insertions
• Insert 19
B-Tree: Insertions
• Insert 21
B-tree: Insertion
• When will the height of the tree grow?
• The tree always grows at the top when the root splits
• Hence, always balanced
B-tree deletions
• Remove key, and replace with the largest smaller node or the
smallest larger node (in either case, the node should reside in
a leaf)
• If underflow occurs in the leaf node, merge with sibling and
also demote common parent key to the merged node.
• If the merged node overflows, follow the same procedure as in
insertion, which is to split based on median
• If the parent node underflows, recurse using the same procedure
B-Tree: Deletions
• Remove 26
• Other possibilities?
B Tree: deletions
• Remove 22
B-Tree : deletions
• Remove 18
Identifying m
• The smallest unit that can be read is a block
• Identify the block size
• Identify the size of each record you are storing in a node
• Compute m
Identifying m
• Example:
• Assume block size is 4kb
• Data Stored at each node
• Series of records. Assume it is just the key represented using an integer (4 bytes)
• Pointers to children represented using integers as well
• 2m*4+(2m+1)*4=4*1024
• Therefore, m~255
Weakness of B-Tree
• We assume each record is just the key itself
• Often not a realistic assumption
• Records of students with roll no. as the primary key
• Addresses sorted by zip codes
• What happens when a record is more than a key?
• Reduces branching factor
• Solution?
• Store pointers
• Extra storage for pointers to records
• Movement of pointers on deletes and inserts
• Store all records at leaf nodes. Inner nodes only guides searching procedure.
• B+ Tree
B+ tree
• Similar structure as B-Tree. Except the following differences
• all data keys reside on leaves
• Pointers in leaves point to the data records
• Non-leaf nodes point to children and do not correspond to any data
• Pointers across leaf level siblings for direct access instead of going
through parent
B+ tree
m=2

Only to guide search

All data
B+ Tree: Inserts
• Same as B Tree. However,
• All inserts happen at the leaf level and their impact is propagated
upwards.
• A key at a leaf level is never removed
• Insert 28
B+ Tree Inserts
• Insert 70
B+ tree inserts
• Insert 95
B+ tree deletes
• Again, same procedure as B-Tree. However,
• a delete is always meant for a key (which is a leaf level value).
• May need to re-write the ``search keys’’ in higher internal nodes
B+ Tree deletes
• Delete 70
B+ Tree deletes
• Delete 25
B+ Tree deletes

• Delete 60

28 50 65 85

65 75 80
Announcements
• Deadline for Assignment 2 extended by 1 day
• Please provide Mid-sem feedback
• Available in Moodle page for COL 362
What were the assumptions made by B-
Tree or B+ Tree?
• Data is 1D
• Disk resident
• Lots of updates
• Range queries or point queries
General Setting
Exact match query or Point Query
Point query based on distance function
Similarity Search
Similarity Search: Example
Application
• Top-k search
• Google search
• Range Search
• Apartments within my budget
• Computationally, which one is harder?
• Range query: O(n)
• Top-k: O(nlog k)
KD Tree
Structure
Object x,y

• At level i, split on Chicago 35,42


dimension (i Mod d)+1 Mobile 52,10
• d is the number of
dimensions Toronto 62,77

• We will assume 2D, but as Buffalo 82,65


you can see it is easy to Denver 5,45
generalize to higher
dimensions Omaha 27,35

Atlanta 85,15

Miami 90,5
Properties
• Each subtree represents a grid
• Not balanced
• Bulk-load to create balanced KD tree
• Split on median
• Storage
• O(n)
• Construction time
• For bulk-loading: O(nlogn) to presort points in each dimension
• Depth is O(logn). Thus, inserting all points take O(nlogn)
Searching
• Point Query
• Choose the branch with the
possibility of containing the
point
• Cost
• O(log n)
• Range Search
• Identify all regions (or
subtrees) that intersect or lie
the query region
Range Search algorithm
• Start from the root. For each children check
• If completely within query region, report entire subtree
• If intersects query region, recurse
• Else ignore this children
Range Search: Computation Cost
• Two situations
• Subtree entirely within query region
• Cost?
• O(k), k is size of subtree
• Subtree intersects query region
• Cost?
• How many regions will be intersected?
Number of intersections
• 
Computation Cost
• 
Range Trees
Kd-trees, which were described in the previous section, have O(√n + k) query time for 2d. So when the
number of reported points is small, the query time is relatively high. In this section we shall describe
another data structure for rectangular range queries, the range tree, which has a better query time,
namely O((log n)*(log n) + k). The price we have to pay for this improvement is an increase in storage
from O(n) for kd-trees to O(nlogn) for range trees.
Structure
• Build a binary search tree (BST) on dimension x with following
changes
• All data points stay at leaves
• Leaves are doubly linked
• Each node n in the BST points to another BST on y
• This second BST on y is built only on the values residing in the subtree
rooted at n
Example
Object x, y

Chicago 35,42

Mobile 52,10

Toronto 62,77

Buffalo 82,65

Denver 5,45

Omaha 27,35

Atlanta 85,15

Miami 90,5
Properties
• Storage
• Size of BST(x): 2n-1=O(n)
• At each level of BST(X), all data points reside in some corresponding BST(Y).
This means a single data point is stored in h different BSTs on Y, where h is
the height of BST(X). Since h=logn, combined storage of all BST(Y)=O(nlogn).
• Therefore, O(nlogn)
• Cost of building the tree
• Sort points based on X and Y dimensions: O(nlogn)
• Building a BST on sorted points of size n: O(n)
• Building BST(X)=O(n)
• For each node in BST(X), build another BST(y). Like in storage analysis, the
combined size of all BST(y) corresponding to a level is O(n). Since height is
logn, construction time of all BST(y)=O(nlogn)
• Total: O(nlogn)
Range Query
• Basic idea: two level
search
• Identify region within x
boundary
• Then, for that subset of
points, identify those
within y boundary
Range Query
QQ
• Identify query region grid
(Lx,Rx,Ty,By)
• Search in BST(x) and identify the
leftmost leaf and rightmost leaf that
cover the range [Lx , Rx ].
• Identify the least common ancestor Q
• Search in BST(Y) of all nodes to the
right of path between Lx-Q and left
of Rx-Q
Rx
Lx
Example
• Which BST(y)s searched for the
shown Lx and Rx?
• A,B,D,E,F,H
Range Query: Cost
• 
Let P be a set of n points in the plane. A range tree for P uses O(nlogn) storage and can be
constructed in O(nlogn) time. By querying this range tree one can report the points in P that lie
in a rectangular query range in O(log2 n + k) time, where k is the number of reported points
Generalizing to higher dimensions
Fractional Cascading
Can we do it faster?
Why is this relevant?
• We are doing searches on subsets O(logn) times.
• Use Fractional Cascading to speed things up
Range Search with fractional cascading:
Structure
• Instead of storing BST(y) at each node, store a sorted array on
Y
• At xsplit, do binary search to find ybottom
• Identify subtrees between x1 and x2
• Simply follow pointers from parent to identify beyond ybottom
• Follow array till you hit ytop
Range Search with fractional cascading:
Structure
Cost
• O(logn+k)
One more example
Distances
• 
Top-k Query: Best-first algorithm
• Two Heaps
• Answer Set: Max Heap of size k
• Candidates MBRs (Minimum bounding rectangle): Min Heap
• Initially contains just the root
• Algorithm
• Pop top MBR from Candidate
• If MBR is further to query than the top node in Answer Set return
• Else
• If MBR is a data point and closer to query than the top node in Answer Set, insert in
Answer set
• Insert those children of MBR to Candidate that are closer to query point than the top
node in Answer set
• Iterate as long as Candidate is not empty or top MBR is Candidate is further
than the top data point in Answer set.

Vous aimerez peut-être aussi