Académique Documents
Professionnel Documents
Culture Documents
1, JANUARY 2010
Abstract—Massive Radio Frequency Identification (RFID) data sets are expected to become commonplace in supply chain
management systems. Warehousing and mining this data is an essential problem with great potential benefits for inventory
management, object tracking, and product procurement processes. Since RFID tags can be used to identify each individual item,
enormous amounts of location-tracking data are generated. With such data, object movements can be modeled by movement graphs,
where nodes correspond to locations and edges record the history of item transitions between locations. In this study, we develop a
movement graph model as a compact representation of RFID data sets. Since spatiotemporal as well as item information can be
associated with the objects in such a model, the movement graph can be huge, complex, and multidimensional in nature. We show that
such a graph can be better organized around gateway nodes, which serve as bridges connecting different regions of the movement
graph. A graph-based object movement cube can be constructed by merging and collapsing nodes and edges according to an
application-oriented topological structure. Moreover, we propose an efficient cubing algorithm that performs simultaneous aggregation
of both spatiotemporal and item dimensions on a partitioned movement graph, guided by such a topological structure.
1 INTRODUCTION
separate In- and Out-gateways by matching incoming and In the Split-Only model, we can gain significant
outgoing edges carrying the same subset of items into the compression by creating a hierarchy of gids, rooted at
corresponding single direction traffic gateways. factories where items move in the largest possible groups,
Notice that gateways may naturally form hierarchies. For and pointing to successively smaller groups as items move
example, one may see a hierarchy of gateways, e.g., country down the supply chain. In this model, a single grouping
level sea ports ! region level distribution centers ! state schema provides good compression because the basic
level hubs. groups, in which objects move, are preserved throughout
the different locations, i.e., the smallest groups that reach
the stores are never shuffled, but are preserved all the way
5 DATA COMPRESSION
from the Factory. In the next section, we will present a more
5.1 Redundancy Elimination Compression general model that can accommodate both split and
RFID data contains large amounts of redundancy. Each merging of groups.
reader scans for items at periodic intervals, and thus,
generates hundreds or even thousands of duplicate readings 5.2.2 Merge-Split Model
for items in its range, which are not moving. For example, if a A more complex model of object movements is observed in
pallet stays at a warehouse for 7 days, and the reader scans a global supply chain operation, where items may merge,
for items every 30 seconds, there will be 20,160 readings of split, and groups of items can be shuffled several times. One
the form ðEP C, warehouse, timeÞ. We could compress all such case is when items move between exporting and
these readings, without loss of information, to a single tuple importing countries. At the exporting country, items merge
of the form ðEP C; warehouse, time in, time outÞ, where into successively large groups in their way from factories to
time in is the first time that the EP C was detected in the logistic centers, and finally to large shipping ports. In the
warehouse and time out the last one. importing country, the process is usually reversed, items
Redundancy elimination can be accomplished by sorting split into successively smaller groups as they move from the
the raw data on EPC and time, and generating time in and incoming port, to distribution centers, and all the way to
time out for each location by merging consecutive records individual stores. We say that movement graphs with this
for the same object staying at the same location. topology present a Merge-Split model of object movements.
A single object grouping model, such as the one used in a
5.2 Bulky Movement Compression Split-Only model would not be optimal when groups of
Since a large number of items travel and stay together items can both split and merge. A better option is to
through several stages, it is important to represent such a partition the movement graph around gateways, and define
collective movement by a single record no matter how an item grouping model at the partition level. For example,
many items were originally collected. As an example, if the exporting country would get a hierarchy of groups
1,000 boxes of milk stayed in location locA between time t1 rooted at the port and ending at the factories, while the
(time in) and t2 (time out), it would be advantageous if importing country will have a separate hierarchy rooted at
only one record is registered in the database rather than the port and ending at the individual stores. Using a single
1,000 individual RFID records. The record would have the grouping for both partitions has the problem that each
form ðgid; prod; locA ; t1 ; t2 ; 1;000Þ, where 1,000 is the count, group would have to point to many small subgroups, or
prod is the product id, and gid is a generalized id which will even just individual items that are preserved throughout
not point to the 1,000 original EPCs but instead point to the the entire supply chain, after multiple operations of merge,
set of new gids which the current set of objects move to. For split, and shuffle. Separate groupings prevent this problem
example, if this current set of objects were split into by requiring bulky movement only at the partition level,
10 partitions, each moving to one distinct location, gid will and allowing for merge, split, and even shuffling of items
point to 10 distinct new gids, each representing a record. without loss of compression.
The process iterates until the end of the object movement
where the concrete EPCs will be registered. By doing so, no 5.3 Data Generalization
information is lost but the number of records to store such Since many users are only interested in data at a relatively
information is substantially reduced. high abstraction level, data compression can be explored to
The process of selecting the most efficient grouping for group, merge, and compress data records. This type of
items, both in terms of compression and query processing, compression as opposed to the previous two compression
depends on the movement graph topology. methods is lossy, because once we aggregate the data at a
high level of abstraction, e.g., time aggregated from second
5.2.1 Split-Only Model to hour, we cannot ask queries for any level below the
In some applications, the movement graph presents a tree- aggregated one.
like structure, with a few factories near the root, ware- There are two types of data generalization: item-based,
houses and distribution centers in the middle, and a large which is the same encountered in traditional data cubes and
number of individual stores at the leaves. In such topology, does not involve spatiotemporal dimensions; and path-
it is common to observe items moving in large groups near based, which is unique to RFID data sets.
the factories and splitting into smaller groups as they Path-Level Generalization. A new type of data general-
approach individual stores. We say that movement ization, not present in traditional data cubes, is that of
graphs with this topology present an Split-Only model of merging and collapsing path stages according to time and
object movements. location concept hierarchies. For example, if the minimal
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 95
granularity of time is hour, then objects moving within the shipment sizes. Finally, for the edges that pass the above
same hour can be seen as moving together and be merged two filters, check which locations split the paths going
into one movement. Similarly, if the granularity of the through the location into two largely disjoint sets; that is,
location is shelf, objects moving to the different layers of a the locations in paths involving the gateway can be split
shelf can be seen as moving to the same shelf and be merged into two subsets, locations occurring in the path before the
into one. gateway and those occurring in the path after the gateway.
Another type of path generalization is that of expanding
different types of locations to different levels of abstraction, 6.2 Partitioning Algorithm
depending on the analysis task. For example, a transporta- The movement graph partitioning problem can be framed as
tion manager may want to collapse all movements inside a traditional graph clustering problem and we could use
stores and warehouses while expanding movements within techniques such as spectral clustering [9], [20]. But for the
trucks and transportation centers to a very detailed level. specific problem of partitioning supply chain movement
On the other hand, store managers may want to collapse all graphs, we can design a less costly algorithm that takes
object movements outside their particular stores. advantage of the topology of the graph to associate locations
An important difference between path level general- to those gateways to which they are more strongly connected.
ization and the more conventional data cube generalization The key idea behind the partitioning algorithm is that in
along concept hierarchies is that in path level aggregation, the movement graph for a typical supply chain application,
we need to preserve the path structure of the data, i.e., we locations only connect directly (without going through
need to make sure that the new times, locations, and another gateway) to a few gateway nodes. That is, very few
transitions are consistent with the original data. items in Europe reach the major ports in the United States
without first having gone through Europe’s main shipping
ports. Using this idea, we can associate each location to the
6 MOVEMENT GRAPH PARTITIONING set of gateways that it directly reaches (we use a frequency
In this section, we discuss the methods for identifying threshold to filter out gateways that are reached only
gateways, partitioning based on the movement graph, and rarely), when two locations li and lj have a gateway in
associating partitions to gateways. common we merge their groups into a single partition
containing the two locations and all their associated
6.1 Gateway Identification gateways. We repeat this process until no additional merge
In many applications, it is possible for data analysts to is possible. At the end, we do a postprocessing step where
provide the system with the complete list of gateways, this we associate very small partitions to the larger partition to
is realistic in a typical supply chain application where the which it most frequently directly connects to.
set of transportation ports is well known in advance, e.g., Analysis. Algorithm 1 presents the details of movement
Walmart knows all the major ports connecting its suppliers graph partitioning given a set of gateways. In a single scan
in Asia to the entry ports in the United States. In some other of the path database, we compute statistics on the traffic
cases, we need to discover gateways automatically. We can from each node to the different gateways. We then go
use existing graph partitioning techniques such as balanced through the list of locations merging sets of locations that
minimum cut or average minimum cut [9], to find a small set of share common gateways. Finally, we merge small clusters
edges that can be removed from the graph so that the graph into larger ones. This algorithm scales linearly with the size
is split into two disconnected components; such edges will of the path database, linearly with the number of nodes in
typically be associated with the strong traffic edges of in- or the movement graph, and quadratically with the number of
out- gateways. Gateways could also be identified by using gateways in the movement graph. We can further speed up
the concept of betweenness and centrality in social network the algorithm by running it on a random sample of the
analysis as they will correspond to nodes with high original database instead of running it on the full data. This
betweenness as defined in [13] and we can use an efficient is possible because the structure of the supply chain is
algorithm such as [5] to find them. usually fairly stable over time, and a representative random
Here, we propose a simple but effective approach to sample is enough to capture the topology of the graph.
discover gateways that works well for typical supply
Algorithm 1 Movement graph partitioning
chain operations where gateways have strong character-
Input: GðV ; EÞ: a movement graph, W V : the set of
istics that are easy to identify. We can take a movement
gateways, D: a path database, min nodes: min. # of vertices
graph, and rank nodes as potential gateways based on the
following observations: 1) a large volume of traffic goes per partition, min connectivity: min. # of paths to gateway.
T
through gateway nodes, 2) gateways carry unbalanced Output: A partition of V into V1 ; . . . ; Vk s.t. Vi Vj ¼
traffic, i.e., incoming and outgoing edges carrying the same ; 8i 6¼ j.
tags but having very different average shipment sizes, and Method
3) gateways split paths into largely disjoint sets of nodes 1: Let C be a connection matrix, with entries C½li ; gj that
that only communicate through the gateway. The algo- indicate the number of times that location li connects to
rithm can find gateways by eliminating first low-traffic gateway gj . Initialize every entry in C to 0.
nodes and then the nodes with balanced traffic, i.e., 2: for each path p in D’ do
checking the number of incoming and outgoing edges, and 3: for each location l in p do
the ratio of the average incoming (outgoing) shipment 4: C½l; gþ ¼ 1 where g is the next gateway after l in
sizes versus the average of the outgoing (incoming) p, or g is the previous gateway before l in p.
96 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
5: end for false if intermediate nodes were visited, gid list is the list of
6: end for items that traveled together, and measure list is a set of
7: for each location l 2 V do aggregate functions computed on the items in gid list while
8: Lg ¼ all gateways s.t. C½l; g > min connectivity they took the transition, e.g., it can be count, average
9: add l to the partition containing all gateways in Lg , if temperature, average movement cost, etc. We will elaborate
the gateways in Lg reside in different partitions merge more on concept of gids in the section describing the
the partitions and add l and the merged partition, if map table.
no partition exists that contains any element in Lg An alternative to recording edges in the graph is to
S record nodes and the history of items staying at each node.
create a new partition containing Lg flg.
10: end for The particular representation should match the application
11: for all partitions Vi s.t. jVi j < min nodes do needs, if most queries ask about properties of items at a
12: merge Vi with the partition Vj s.t. the traffic from/to given location, materializing nodes may be better, and if
most queries ask about properties during transition,
gateways in Vi to/from gateways in Vj is maximum,
materializing edges is better. And if appropriate we can
this can be determined using the connection matrix C.
materialize both.
13: end for
14: return partitions V1 ; . . . ; Vk 7.2 Map Table
Bulky movement means a large number of items move
6.3 Handling Sporadic Movements: together. A generalized identifier gid can be assigned to
Virtual Gateways every group of items that moves together, which will
An important property of gateways is that all the traffic substantially reduce the size of the tag lists at each edge.
leaving or entering a partition goes through them. How- When groups of items split into smaller groups, gid (original
ever, in reality it is still possible to have small sets of group) can be split into a set of children gids, representing
sporadic item movements between partitions that bypass these smaller groups. The map table contains entries of the
gateways. Such movements reduce the effectiveness of form hpartition; gid, contained list, contains listi, where
gateway-based materialization because path queries invol- partition is the subgraph of the movement graph where this
ving multiple partitions will need to examine some path map is applicable, contained list is the list of all gids with
segments of the graph unrelated to gateways. This problem a list of items that is a superset of the items gid, and
can be easily solved by adding a special virtual gateway to contains list is the list gids with item lists that are a subset
each partition for all outgoing and incoming traffic from of gid, or a list of individual tags if gid did not split into
and to other partitions that does not go through a gateway. smaller groups.
Virtual gateways guarantee that intergateway path queries There are two main reasons for using a map table instead
can be resolved by looking at gateway-related traffic only. of recording the complete EPC lists at each stage: 1) data
For our running example, Fig. 1, we can partition the compression and 2) query processing efficiency.
movement graph along the dotted circles, and associate Compression: First, we do not want to record each RFID
gateway G1 with the first partition and gateway G2 with the tag on the EPC list for every stay record it participated in. For
second one. In this case, we need to create a virtual gateway example, if we assume that 10,000 items move in the system
Gx to send outgoing traffic from the first partition (i.e., in groups of 10,000, 1,000, 100, and 10 through four stages,
traffic from B) that skips G1 , and another virtual gateway instead of using 40,000 units of storage for the EPCs in the
Gy to receive incoming traffic into the second partition (i.e., stay records, we use only 1,111 units2 (1,000 for the last stage
traffic to I) that skips G2 . and 100, 10, and 1 for the ones before).
Since real RFID data sets involve both merge and split
7 STORAGE MODEL movement models, the Split-Only model cannot have much
sharing and compression. Here, we adopt a Merge-Split
With the tremendous amounts of RFID data, it is crucial to model, where objects can be merged, shuffled, and split in
study the storage model. We propose to use three data many different combinations during transportation. Our
structures to store both compressed and generalized data: mapping table takes a gateway-centered model, where map-
1) an edge table, storing the list of edges, or alternatively an ping is centered around gateways, i.e., the largest merged
stay table, storing the list of nodes, 2) a map table, linking and collective moving sets at the gateways become the root
groups of items moving together, and 3) an information gids, and their children gids can be spread in both directions
table, registering path-independent information related to along the gateways. This will lead to the maximal gid
the items in the graph. sharing and gid_list compression.
7.1 Edge Table Query Processing: The second and the more important
reason for having such a map table is the efficiency in query
This table registers information on the edges of the movement
processing. By creating gid lists that are much shorter than
graph, the format is hfrom; to; t start; t end,direct; gid list; :
EPC lists, we can compute path-related queries very
measure listi, where from is the originating node, to is
quickly. To compute, for example, the average duration
the destination node, t start is the time when the items
for milk to move from the distribution center (D), to the
departed the location from, t end is the time when the items
arrived at the location to, direct is a boolean value that is 2. This figure does not include the size of the map itself which should use
true if the items moved directly between from and to and 12,221 units of storage, still much smaller than the full EPC lists.
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 97
store backroom (B), and finally to the shelf (S), we need to in the same format as direct ones, but with the flag direct set
locate the edge records for milk between the stages and to false.
intersect the EPC lists of each. By using the map, the EPC The benefit of materializing a given indirect edge in the
lists can be orders of magnitude shorter, and thus, reduce movement graph is proportional to the number of queries for
IO costs. which this edge reduces the total processing cost. Indirect
edges, involved in a path query, reduce the number of
7.3 Information Table edges that need to be analyzed and provide shorter tag lists
The information table records other attributes that that are faster to retrieve and intersect. In order for an
describe properties of the items traveling through the indirect edge to help a large number of queries, it should
edges of the movement graph. The format of the tuples in have three properties: 1) carry a large volume of traffic,
the information table is hgid list, D1 ; . . . ; Dn i, where 2) be part of a large portion of all the paths going from
gid list is the list of items that share the same values nodes in one partition of the graph to nodes in any other
on the dimensions D1 to Dn , and each dimension Di partition, and 3) be involved directly or indirectly in a large
describes a property of the items in gid list. An example number of path queries. The set of edges that best match
of attributes that may appear in the information table these characteristics are the following.
could be product, manufacturer, or weight. Each dimen-
sion of the information table can have an associated 8.2.1 Node-to-Gateway
concept hierarchy, e.g., the product dimension may have a In supply chain implementations, it is common to find a few
hierarchy such as EP C ! SKU ! product ! category. well-defined Out-gateways that carry most of the traffic
leaving a partition of the graph where items are produced,
before reaching a partition of the graph where items are
8 MATERIALIZATION STRATEGY
consumed. For example, products manufactured in China
Materialization of path segments in the movement graph may destined for exports to the United States leave the country
speedup a large number of path-related queries. Since there through a set of ports. We propose to materialize the (virtual)
is an exponential number of possible path segments that can edges from every node to the Out-gateways that it first reaches.
be precomputed in a large movement graph, it is only realistic Such materialization, for example, would allow us to
to partially materialize only those path segments that quickly determine the properties of shipments originating
provide the highest expected benefit at a reasonable cost. at any location inside China and leaving the country.
We will develop such a strategy here.
8.2.2 Gateway-to-Node
8.1 Path Queries
Another set of important nodes for indirect edge materi-
A path query requires the computation of a measure over alization are In-gateways, as most of the item traffic entering
all the items with a path that matches a given path pattern. a region of the graph where items are consumed has to go
It is of the form: q hc info, path expression, measurei, through an In-gateway. For example, imported products
where c info is a selection on the information table that coming into the United States all arrive through a set of
retrieves the relevant items for analysis; path expression is a major ports. When we need to determine which items sold
sequence of stage conditions on location and time that in the United States have paths that involve locations in
should appear in every path, in the given order but possibly foreign countries, we can easily get this information by
with gaps; and measure is a function to be computed on precomputing the list of items that arrived at the location
the matching paths. An example path query may be c info from each of the In-gateways. We propose to materialize all
¼ fproduct ¼ beef, sale_date ¼ 2006g, path expression ¼ the (virtual) edges from an In-gateway to the nodes that it
{Argentina farm A, San Mateo store S}, and measure ¼ reaches without passing through any other gateway.
average temperature, which asks for the average tempera-
ture recorded for each beef package, traveling from a certain 8.2.3 Gateway-to-Gateway
farm in Argentina to a particular store in San Mateo. Another interesting set of indirect edges to materialize are
There may be many strategies to answer a path query, the ones carrying inter-gateway traffic. For example, we want
but, in general, we will need to retrieve the appropriate tag to precompute which items leaving the Shanghai port
lists and measures for the edges along the paths involving finally arrive at the New York port. The benefit of such
the locations in the path expression; retrieve the tag list for indirect edge is twofold: First, it aggregates a large number
the items matching the info selection; intersect the lists to of possible paths between two gateways and precomputes
get the set of relevant tags; and finally, if needed, retrieve important measures on the shipments; and second, it
the relevant paths to compute the measure. allows us to quickly determine which items travel
between partitions.
8.2 Path Segment Materialization
We can model path segments as indirect edges in the Lemma 8.1. A movement graph with k partitions, p1 ; . . . ; pk ,
movement graph. For example, if we want to precompute the each partition i with pni nodes and pgi gateways, will require the
list of items moving from location li to location lj through materialization
P of a number
P of indirect edges that is bounded
any possible path, we can materialize an edge from li to lj by ki¼1 ðpni pgi Þ þ i6¼j pgi pgj .
that records a history of all tag movements between the Proof. For each node in each partition we will materialize at
nodes, including movements that involve an arbitrary most pgi indirect edges, this is when the node has traffic
number of intermediate locations. Indirect edges are stored to or from every gateway, the maximum number of node
98 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
remove all transportation-related locations, a new edge cells only once. The size of the full movement graph
(F actory, Store) will be created, with all the edges to and cube is thus the total number of distinct cells in all the
from transportation locations removed. RFID-cuboids. When materializing the cube or a subset
Graph aggregation involves different operations over the of RFID-cuboids, we compute all relevant cells to those
gid lists at each edge, when we remove nodes we need to RFID-cuboids without duplicating shared cells between
intersect gid lists to determine the items traveling through RFID-cuboids.
the new edge, but when we simply aggregate locations to
higher levels of abstraction (without removing them) we 9.3 Cube Computation
need instead to compute the union of the gid lists of several In this section, we introduce an efficient algorithm, that in a
edges. For example, looking at Fig. 4 in order to determine single scan of the fact table, simultaneously computes the
the gid list for the edge (F actory, Store) we need to intersect set of interesting RFID-cuboids, as defined by the user or
the gid lists of all outgoing edges from the node F actory determined through selective cube materialization techni-
with the incoming edges to the node Store; on the other ques such as those proposed in [20]. The computed RFID-
hand, if we aggregate transportation locations to a single cuboids can then be used to answer the most common
node in order to determine the gid list for the edge OLAP and path query operations efficiently, and can also
(T ransportation, Store), we need to union the gid lists of be used to quickly compute nonmaterialized RFID-cuboids
the edges ðHub; StoreÞ and ðW eighting; StoreÞ. on the fly.
Partition-based aggregation. We will aggregate each
9.2 Cube Structure partition of the graph independently, i.e., paths will be
Fact table. The fact table contains information on the divided into disjoint segments according to the partitions
movement graph and the items aggregated to the minimum defined in the movement graph, and each segment in the path
level of abstraction that is interesting for analysis. Each will be aggregated independently, without merging loca-
entry in the fact table is a tuple of the form hfrom, to, t start, tions from separate segments. This technique guarantees
t end, d1 , d2 ; . . . ; dk: gid list: measure listi, where gid list is that for any RFID-cuboid we can still use the gateway-based
list of gids that contains all the items that took the transition materialization to improve query performance. If we need
between from and to locations, starting at time t in and to compute inter-partition aggregation, it can be done at
ending at time t out, and all share the dimension values runtime by using the best available RFID-cuboids from each
d1 ; . . . ; dk for dimensions D1 ; . . . ; Dk in the info table, partition, and computing required aggregation on top of
measure list contains a set of measures computed on the those at runtime.
gids in gid list. Algorithm 2 presents an efficient cubing algorithm that
Measure. For each entry in the fact table we register the does simultaneous aggregation of every interesting cell in
gid list corresponding to the tags that match the dimension parallel, with a single scan of the path database.
values in the entry. We can also record for each gid in the Path prefix tree. The algorithm first constructs a prefix
list a set of measures recorded during shipping, such as tree for each partition of the movement graph. For this
average temperature, total weight, or count. We can use the purpose, paths are divided into disjoint fragments sepa-
gid list to quickly retrieve those paths that match a given rated by gateways; in the case when locations belonging to
slice, dice, or path selection query at any level of abstraction. separate partitions appear in the same path fragment, we
When a query is issued for aggregate measure that is separate them with a virtual gateway. Each path is
already precomputed in the cube, we do not need to access converted into a sequence of edges of the form ðfrom, to,
the path database, and all query processing can be done t in, t outÞ, the first edge of every path has a from location
directly on the aggregated movement graph. For example, if equal to the special symbol ‘ , and the last edge in the path
we record count as a measure, any query asking for counts has a to location that is the special symbol a . After the
of items moving between locations can be answered directly prefix trees have been constructed, we assign a unique gid
by retrieving the appropriate cells in the cube. When a to the items aggregated in each node of the tree that share
query asks for a measure that has not been precomputed in the same values on all item dimensions. Fig. 5 presents the
the cube, we can still use the aggregated cells to quickly path prefix tree computed on the path database of Table 1,
determine the list of relevant gids and retrieve the and the first partition in Fig. 1. Notice that we have created
corresponding paths to compute the measure on the fly. the virtual gateway Gx and added it to the prefix path.
RFID Cuboids. A cuboid in the RFID cube resides at a Cuboid materialization. After building the path prefix
level of abstraction of the location concept hierarchy, a level trees, we are ready to start building the relevant set of RFID-
of abstraction of the time dimension, and a level of cuboids. This is done by traversing each prefix tree, one
abstraction of the info dimensions. Path aggregation is used branch at a time, and generating all possible edges from the
to collapse uninteresting locations, and item aggregation is branch, including 1) direct edges that correspond to a single
used to group-related items. Cells in a movement graph RFID- node in the tree, 2) edges generated by performing path
cuboids group both items, and edges that share the same collapsing to all interesting location levels of abstraction, and
values at the RFID-cuboid abstraction level. 3) indirect edges from each location in the path to a
It is possible for two separate RFID-cuboids to share a gateway. All the cells involving these edges are then
large number of common cells, namely, all those corre- updated to include the gids associated with the edge and
sponding to portions of the movement graph that are their measures adjusted. We also do aggregation of RFID-
common to both RFID-cuboids, and that share the same cuboids that include only item dimensions (i.e., location and
item aggregation level. A natural optimization is to compute time dimensions are aggregated to all) by aggregating the
100 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
Fig. 6. Fact table size versus path db size (S ¼ 300). Fig. 8. Fact table size versus shipment size (N ¼ 108;000).
Fig. 7. Map table size versus path db size (S ¼ 300). Fig. 9. Map table size versus shipment size (N ¼ 108;000).
this comparison is fair as both models materialize only map table due to shipment size. We can observe a very
direct edges. When we perform gateway materialization, significant reduction in map size, even for small ship-
the size of the model increases, but the increase is still linear ment sizes. This is a clear indication that partition level
on the size of the movement graph (much better than full map tables provide a clear advantage over a global map
materialization of edges between every pair of nodes which table. This is important not only in terms of space, but
is quadratic in the size of the movement graph), and close to also in terms of query processing, as we will see in the
the size of the model in [17]. next section.
Fig. 7 presents the size of the map table for the partitioned
part gw mat and nonpartitioned no part models. The 10.3 Query Processing
difference in size is almost a full order of magnitude. The An important contribution of our model is efficiency
reason is that our partition level maps capture the semantics in query processing. In these experiments, we generate
of collective object movements much better than [17]. This 100 random path queries that ask for a measure on the path
has very important implications in compression power, and segments for items matching an item dimension condition
more importantly, in query processing efficiency. that go from a single initial location to a single ending
Fig. 8 presents the size of the fact table, as we vary the
location and that occur within a certain time interval. We
shipment size, under four different models part no gw
compare the partition movement graph with gateway
mat, part gw mat, no part, and path db. We see that
materialization part gw mat, against the partitioned graph
compression improves as we increase shipment sizes.
Gateway materialization increases the size of the fact part no gw mat without gateway materialization, and the
table, it is still much smaller than the original path nonpartitioned graph no part. All the queries were answer-
database, except for very small shipment sizes, and it is ing a movement graph at the same abstraction level as the
also smaller than a nonpartitioned fact table. original path database. We restrict the analysis of queries
Fig. 9 presents the size of the map table, as we vary with starting and ending locations in different partitions of
shipment size, for the part gw mat and no part models. the graph. These queries are in general more challenging to
This experiment isolates the effect on compression of the answer. Based on our experiments on single-partition
102 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
Fig. 10. Query IO versus path db size (S ¼ 300). Fig. 11. Query IO versus shipment size (N ¼ 108;000).
[35] D. Xin, J. Han, X. Li, and B.W. Wah, “Star-Cubing: Computing Hong Cheng received the PhD degree from
Iceberg Cubes by Top-Down and Bottom-Up integration,” Proc. University of Illinois at Urbana-Champaign in
2003 Int’l Conf. Very Large Data Bases (VLDB ’03), pp. 476-487, Sept. 2008. She is an assistant professor in the
2003. Department of Systems Engineering and En-
[36] Y. Zhao, P.M. Deshpande, and J.F. Naughton, “An Array-Based gineering Management at the Chinese Univer-
Algorithm for Simultaneous Multidimensional Aggregates,” Proc. sity of Hong Kong. Her primary research
1997 ACM Special Interest Group on Management of Data Int’l Conf. interests include data mining, machine learning,
Management of Data (SIGMOD ’97), pp. 159-170, May 1997. and database systems. She has published more
than 20 research papers in international con-
Hector Gonzalez received the PhD degree ferences, journals, and book chapters, including
from the University of Illinois at Urbana-Cham- SIGMOD, VLDB, SIGKDD, ICDE, ICDM, SDM, ACM Transactions on
paign in 2008. Prior to the PhD degree, he KDD, and Data Mining and Knowledge Discovery, and received
completed the MBA degree from Harvard research papers awards at ICDE ’07, SIGKDD ’06, and SIGKDD ’05.
Business School in 1999. He is a research
scientist working at Google Research. He Xiaolei Li received the BS, MS, and PhD
conducts research on data mining, data ware- degrees in computer science from the Univer-
housing, and information integration. sity of Illinois at Urbana-Champaign in May of
2002, 2004, and 2008, respectively. He is
currently working for Microsoft AdCenter Labs
researching a variety of topics related to
advertising and other online services.
Jiawei Han is a professor in the Department of
Computer Science at the University of Illinois.
He has been working on research into data
mining, data warehousing, stream data mining,
spatiotemporal and multimedia data mining,
biological data mining, social network analysis,
text and Web mining, and software bug mining, Diego Klabjan is an associate professor in the
with more than 400 conference and journal Department of Industrial Engineering and Man-
publications. He has chaired or served in over agement Sciences, Northwestern University. He
100 program committees of international con- received the doctorate degree from the School
ferences and workshops and also served or is serving on the editorial of Industrial and Systems Engineering of the
boards for Data Mining and Knowledge Discovery, IEEE Transactions Georgia Institute of Technology in 1999. He
on Knowledge and Data Engineering, Journal of Computer Science and joined the University of Illinois at Urbana-
Technology, and Journal of Intelligent Information Systems. He is Champaign in 1999. In 2007, he became an
currently the founding editor-in-chief of ACM Transactions on Knowl- associate professor at Northwestern. He was the
edge Discovery from Data (TKDD). He has received IBM Faculty recipient of the first prize of the 2000 Transpor-
Awards, the Outstanding Contribution Award at the International tation Science Dissertation Award and has received various other
Conference on Data Mining (2002), ACM Service Award (1999), ACM awards with graduate students. He is a former president of the Institute
SIGKDD Innovation Award (2004), and IEEE Computer Society of Operations Research and the Management Sciences (INFORMS)
Technical Achievement Award (2005). He is an ACM and IEEE fellow. Aviation Applications Section. He is an associate editor for Naval
His book Data Mining: Concepts and Techniques (Morgan Kaufmann) Research Logistics and two areas in Operations Research. His research
has been used worldwide as a textbook. is focused on transportation, supply chain management, radio frequency
identification, and large-scale optimization.