Vous êtes sur la page 1sur 15

90 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO.

1, JANUARY 2010

Modeling Massive RFID Data Sets: A


Gateway-Based Movement Graph Approach
Hector Gonzalez, Jiawei Han, Hong Cheng, Xiaolei Li, Diego Klabjan, and Tianyi Wu

Abstract—Massive Radio Frequency Identification (RFID) data sets are expected to become commonplace in supply chain
management systems. Warehousing and mining this data is an essential problem with great potential benefits for inventory
management, object tracking, and product procurement processes. Since RFID tags can be used to identify each individual item,
enormous amounts of location-tracking data are generated. With such data, object movements can be modeled by movement graphs,
where nodes correspond to locations and edges record the history of item transitions between locations. In this study, we develop a
movement graph model as a compact representation of RFID data sets. Since spatiotemporal as well as item information can be
associated with the objects in such a model, the movement graph can be huge, complex, and multidimensional in nature. We show that
such a graph can be better organized around gateway nodes, which serve as bridges connecting different regions of the movement
graph. A graph-based object movement cube can be constructed by merging and collapsing nodes and edges according to an
application-oriented topological structure. Moreover, we propose an efficient cubing algorithm that performs simultaneous aggregation
of both spatiotemporal and item dimensions on a partitioned movement graph, guided by such a topological structure.

Index Terms—RFID, data warehousing, data models.

1 INTRODUCTION

T HE increasingly wide adoption of RFID technology by


retailers to track containers, pallets, and even individual
items as they move through the global supply chain, from
representing object movements, not just that of entries in a
flat fact table.
We propose to model the RFID data warehouse using a
factories in exporting countries, through transportation movement graph-centric view, which makes the warehouse
ports, and finally to stores in importing countries, creates conceptually clear, better organized, and obtaining signifi-
enormous data sets containing rich multidimensional cantly deeper compression and performance gain over
information on the movement patterns associated with competing models in the processing of path queries. The
objects and their characteristics. However, this information importance of the movement graph approach to RFID data
is usually hidden in terabytes of low-level RFID readings, warehousing can be illustrated with an example.
making it difficult for data analysts to gain insight into the Example. Consider a large retailer with a global
set of interesting patterns influencing the operation and supplier and distribution network that spans several
efficiency of the procurement process. In order to realize the countries and that tracks objects with RFID tags placed
full benefits of detailed object tracking information, we at the item level. Such a retailer sells millions of items per
need to develop a compact and efficient RFID cube model
day through thousands of stores around the world, and for
that provides OLAP-style operators useful to navigate
each such item, it records the complete set of movements
through the movement data at different levels of abstraction
between locations, starting at factories in producing
of both spatiotemporal and item information dimensions.
countries, going through the transportation network, and
This is a challenging problem that cannot be efficiently
solved by traditional data cube operators, as RFID data finally arriving at a particular store where the item is
sets require the aggregation of high-dimensional graphs purchased by a customer. The complete path traversed by
each item can be quite long as readers are placed at very
specific locations within factories, ships, and stores (e.g., a
. H. Gonzalez is with Google, Inc., 1600 Amphitheatre Parkway, Mountain production lane, a particular truck, or an individual shelf
View, CA 94043. E-mail: hagonzal@google.com. inside a store). Further, for each object movement, proper-
. J. Han and T. Wu are with the University of Illinois at Urbana-
Champaign, 201 N. Goodwin Avenue, Urbana, IL 61801. ties such as shipping cost, temperature, and humidity can
E-mail: hanj@cs.uiuc.edu, twu5@illinois.edu. be recorded.
. H. Cheng is with The Chinese University of Hong Kong, Room 707, The questions become “how can we present a clean and
William M.W. Mong Engineering Building, Shatin, N.T., Hong Kong.
E-mail: hcheng@se.cuhk.edu.hk. well-organized picture about RFID objects and their move-
. X. Li is with Microsoft, Microsoft AdCenter Labs 1, Microsoft Way, ments?” and “whether such a picture may facilitate data
Redmond, WA 98052-6399. E-mail: xiaoleil@microsoft.com. compression, data cleaning, query processing, multilevel,
. D. Klabjan is with the Northwestern University, 2145 Sheridan Road, Tech
M239, Evanston, IL 60208-3119. E-mail: d-klabjan@northwestern.edu.
multidimensional OLAPing, and data mining?”
Our movement graph approach provides a nice and
Manuscript received 12 July 2008; revised 27 Nov. 2008; accepted 5 Feb. 2009;
published online 25 Feb. 2009. clean picture for modeling RFID objects at multiple levels of
Recommended for acceptance by S. Chakravarthy. abstraction. And it facilitates data compression, data
For information on obtaining reprints of this article, please send e-mail to: cleaning, and answering rather sophisticated queries,
tkde@computer.org, and reference IEEECS Log Number TKDE-2008-07-0353.
Digital Object Identifier no. 10.1109/TKDE.2009.61.
such as
1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 91

between individual locations and their correspond-


ing gateways and between gateways. Such materi-
alization facilitates computing measures on the
paths connecting locations in different partitions.
An efficient graph partitioning algorithm, is devel-
oped, that uses the gateways to split the graph into
clusters in a single scan of the path database.
2. Redundancy elimination compression. RFID read-
ers provide tuples of the form ðEP C; location; timeÞ
at fixed time intervals. When an item stays at the
same location, for a period of time, multiple tuples
will be generated. We can group these tuples into
Fig. 1. An example movement graph. a single one of the form ðEP C; location; time in,
time outÞ. This form of compression is lossless,
. High-level aggregate/OLAP query: What is the and it can significantly reduce the size of the raw
average shipping cost of transporting electronic RFID readings.
goods from factories in Shanghai to stores in San 3. Partition-based bulky movement compression.
Francisco in 2007? And then, click to drill-down to Items tend to move and stay together through
month and see the trend. different locations. For example, a pallet with
. Path query: Print the transportation paths for beef 500 cases of CDs may arrive at the warehouse; from
products from Argentina sold in California on there, cases of 50 CDs may move to the shelf; and
5 April that were exposed to over 40 degree from there, packs of 5 CDs may move to the checkout
centigrade heat for over 5 hours on the route. counter. We can register a single stay or transition
record for thousands of items that stay and move
We propose a movement graph-based model, which leads
together. Such a record would point to a gid, which is
to concise and clean modeling of massive RFID data sets a generalized identifier pointing to the subgroups
and facilitates RFID data compression, query answering, that it contains. In global supply chain applications,
cubing, and data mining. The movement graph is a graph that one may observe a “merge-split” process, e.g., ship-
contains a node for every distinct (or more exactly, ments grow in size as they approach the major ports,
interesting) location in the system, and edges between and then after a long distance bulky shipping, they
locations record the history of shipments (groups of items gradually split (or even recombine) when approach-
that travel together) between locations. For each shipment ing stores. We propose a partitioned map table, that
we record a set of interesting measures such as travel time, creates a separate mapping for each partition, rooted
transportation cost, or sensor readings such as temperature at major shipping ports.
or humidity. We show that this graph can be partitioned 4. Movement graph aggregation. The movement
and materialized according to its topology to speed up a graph can be aggregated to different levels of
large number of queries, and that it can be aggregated into abstraction according to the location concept hier-
cuboids at different abstraction levels according to location, archy that determines which subset of locations are
time, and item dimensions, to provide multidimensional interesting for analysis, and at which level. For
and multilevel summaries of item movements. example, the movement graph may only have loca-
Fig. 1 summarizes our proposed data warehousing tions inside Massachusetts, and it may aggregate
architecture. We receive as input a sequence of RFID every individual location inside factories, ware-
houses, and stores to a single node. This aggregation
readings and store them in database, recording the path
mechanism is very different from the one present in
traversed by each item. Based on the structure of paths, we
traditional data cubes as nodes and edges in the
construct a movement graph, which is partitioned along
graph can be merged or collapsed, and the ship-
gateway nodes. Each partition is then cubed independently. ments along edges and their measures need to be
And a query processing module answers OLAP queries recomputed using different semantics for the cases
over the aggregated movement graph. of node merging and node collapsing.
The technical contributions can be summarized as
follows:
2 RELATED WORK
1. Gateway-based partitioning of the movement RFID technology has been researched from several
graph. We make the key observation that the
movement graph can be divided into disjoint parti- perspectives:
tions that are connected through special gateway
nodes. Most paths with locations in more than one 1. the physics of building tags and readers [11], [29],
partition include the gateway nodes. For example, 2. the techniques required to guarantee privacy and
most items travel from China to the United States safety [6], [14], [30],
by going through major shipping ports in both 3. the software architecture required to collect, filter,
countries. Gateways can be given by a user or be organize, and answer online queries on tags known as
discovered by analyzing traffic patterns. Further, the “EPC Global Network,” which is defined by
materialization can be performed for indirect edges several standards including [2], [10], [26], [28], [32],
92 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

4. cleaning methods to filter noisy RFID observations TABLE 1


[18], [22], [23], [27], An Example Path Database
5. event processing systems on RFID data streams [3],
[24], [25], [34], and
6. storage, warehousing, and mining of massive RFID
data sets [8], [15], [16], [17].
EPC Global Inc. is a standards body that has defined
specifications for many of the components of and RFID
system. It starts by describing data contained in the tag [32],
the communication protocols between reader and tags [33],
[21], transmission of raw RFID readings by readers [28]. At cube shares many common principles with the traditional
a higher level, it specifies how raw readings are converted data cube [1], [7], [19]. They both aggregate data at different
into events [2], and how such events are stored [10]. Finally, levels of abstraction in multidimensional space, but the
lookup of information about a particular EPC is defined by latter is not able to handle aggregation of complex trajectory
the object name service [26]. data. And the problem of RFID-cuboid materialization is
Cleaning of RFID data has been addressed both from the
analogous to the problem of partial data cube materializa-
design of robust communication protocols that operate at
tion studied in [20], [31]. Efficient computation of data
the hardware level [21], [33], and from postprocessing
cubes has been a major concern for the research community,
software designed to detect incorrect readings. At the
and many of the techniques [4], [35], [36] can be applied to
software level, the use of smoothing windows has been
speedup construction of an RFID data cube.
proposed as a method to reduce false negatives. Floerke-
meier and Lampe [12] propose the fixed-window smooth-
ing. Shawn et al. [22] propose a variable-window smoothing 3 RFID DATA
that adapts window size dynamically according to tag In this section, we give a brief introduction to data generation
detection rates, by using a binomial model of tag readings. in RFID applications and explain the difficulties of applying
Jeffrey et al. [23] present a framework to clean RFID data traditional data warehousing models to such data.
streams by applying a series of filters. Gonzalez et al. [18]
present a cost-conscious cleaning framework that learns 3.1 Data Generation
how to apply multiple techniques to increase accuracy and An RFID object tracking system is composed of a collection
reduce costs. of RFID readers scanning for tags at periodic intervals. Each
Event processing systems is another area of research. such reader is associated with a location, and generates a
RFID systems generate a stream of raw events, which stream of time-ordered tuples of the form ðEP C; location;
themselves can form higher level events. Work in this area timeÞ, where EP C 1 is a unique 96-bit electronic product
has focused on efficient processing of very large data code associated with a particular item, location is the place
streams [24], [25], [34], definition of complex events, and where item was detected, and time is the time when the
extensions to standard query languages to account for detection took place. Significant data compression is
unique temporal characteristics of RFID events [3], [34], and possible by merging all the readings for an item that stays
[24] which cope with noise in the RFID stream by defining at a location for a period of time, into a tuple of the form
probabilistic events over low-level events. ðEP C, location, time in, time outÞ, where time in is the
Closer to this work, there is significant research on time when the item identified by EP C entered location and
storage, warehousing, and mining of massive RFID data time out is the time when it left.
sets. Gonzalez et al. [17] introduced the problem of By sorting tag readings on EP C we can generate a path
compressing and warehousing massive RFID data sets. database, where we store the sequence of locations traversed
They proposed the concept of the RFID-cuboid which by each item. Entries in the path database are of the form:
compresses and summarizes an RFID data set by recording ðEP C,ðl1 ; time in1 ; time out1 Þðl2 ; time in2 ; time out2 Þ . . . ðlk ;
information on items that stay together at a location with time ink ; time outk ÞÞ. Table 1 presents an example path
stay records. This model takes advantage of bulky ship- database for six items identified with tags t1 to t6, traveling
ments that are successively split into smaller shipments to through locations A; B; C; D; G1 ; G2 ,F ; I, and J.
compress data. This paper is an extension of [17] to account In addition to the location information collected by RFID
for a more realistic, item movement model, where items not readers, an RFID application has detailed information on
only split, but can merge and split multiple times as they the characteristics of each item in the system, such
move through a global supply chain. Lee and Chung [25] information can be represented with tuples of the form
propose a very clever path encoding technique that ðEP C, d1 , d2 ; . . . ; dm Þ, where each di is a particular value for
improves on the encoding used in [17] to speed up query dimension Di , typical dimensions could be product type,
processing. At the mining end, Gonzalez et al. [15], [16] manufacturer, weight, or price. In many cases we can also
propose the discovery of workflows from RFID data. Such have extra information describing measurements or proper-
workflows summarize major flow trends and significant ties collected during item shipments, this data has the form
flow exceptions at multiple abstraction levels. ðfrom; to; t1 ; t2 ; tag list: measure listÞ, where from and to
Our work on RFID warehousing makes use of several
traditional data warehousing techniques. An RFID data 1. We use EPC and tag interchangeably in the paper.
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 93

are the initial and final locations of the shipment, t1 and t2


are the starting and ending time of the shipment, tag list is
the set of items transported, and measure list describes
properties such as temperature, humidity, or shipping cost.

3.2 Data Cubing Challenges


The path nature of RFID data makes it hard to incorporate Fig. 2. Three types of gateways.
into a traditional data cube while preserving its structure.
Suppose we view the cleansed RFID data as the fact sets of locations in the transportation network. Gateways
table with dimensions ðEP C; location; time in; time out: generally aggregate relatively small shipments from many
measureÞ. The data cube will compute all possible group- distinct, regional locations into large shipments destined for
bys on this fact table by aggregating records that share the a few well-known remote locations; or they distribute large
same values (or any value, represented by the symnol *) at shipments from remote locations into smaller shipments
all possible combinations of dimensions. If we use count as destined to local regional locations. These special nodes are
measure, we can get, for example, the number of items that usually associated with shipping ports, e.g., the port in
stayed at a given location for a given month. The problem Shanghai aggregates traffic from multiple factories and
with this form of aggregation is that it does not consider makes large shipments to ports in the United States, which,
links between the records. For example, if we want to get in turn, split the shipments into smaller units destined for
the number of items of type “dairy product” that traveled individual stores. The concept of gateways is important
from the distribution center in Chicago to stores in Urbana, because it allows us to naturally partition the movement
we cannot get this information. We have the count of “dairy graph to improve query processing efficiency and reduce the
products” for each location but we do not know how many cost of cube computation.
of these items went from the first location to the second. The Gateways can be categorized into three classes: Out-
problem could be solved by increasing the number of Gateways, In-Gateways, and In-Out-Gateways, as described
dimensions, we could use a separate dimensions for distinct below.
path segments; cells would then contain multilocation Out-Gateways. In the supply chain, it is common to
aggregates. The problem with this approach is that as we observe locations, such as ports, that receive relatively
increase the number of dimensions, the size of the data cube low-volume shipments from a multitude of locations and
grows exponentially. We need a model capable of aggregat- send large-volume shipments to a few remote locations.
ing RFID data concisely while preserving its path-like For example, a port in Shanghai may receive products
structure for OLAP analysis. from a multitude of factories and logistics centers
throughout China to later send the products through ship
to a port in San Francisco. We call this type of node an
4 GATEWAY-BASED MOVEMENT GRAPH Out-Gateway, and it is characterized by 1) low ratio of
Among many possible models for RFID data warehouses, average incoming shipment size to average outgoing
we believe the gateway movement graph model not only shipment size, 2) high ratio of the number of incoming
provides a concise and clear view over the movement data, to outgoing edges, and 3) high centrality, in the sense that
but also facilitates data compression, querying, and analysis most shipments can only reach remote locations in the
of massive RFID data sets (which will be clear later). movement graph by going through an Out-gateway. Fig. 2a
presents an Out-gateway. For our running example, Fig. 1,
Definition 4.1. A movement graph GðV ; EÞ is a directed graph location G1 is an Out-gateway.
representing object movements; V is the set of locations, E is In-Gateways. In-gateways are the symmetric comple-
the set of transitions between locations. An edge eði; jÞ ment of Out-Gateways, they are characterized by 1) high
indicates that objects moved from location vi to location vj . ratio of average incoming shipment size to average
Each edge is annotated with the history of object movements outgoing shipment size, 2) low ratio of the number of
along the edge, each entry in the history is a tuple of the form incoming to outgoing edges, and 3) high centrality, in the
ðtstart , tend , tag list: measure listÞ, where all the objects in sense that most shipments can only enter remote locations
tag list took the transition together, starting at time tstart and in the movement graph by going through an in-gateway.
ending at time tend , and measure list records properties of An example of an In-Gateway may be sea port in New York
the shipment. where a large volume of imported goods arrive at the
United States and are redirected to a multitude of
Fig. 1 presents the movement graph for the path database distribution centers throughout the country before reach-
in Table 1. We have a single node for each location, and ing individual stores. Fig. 2b presents an example In-
there is an edge between locations if there is a direct gateway. For our running example, Fig. 1, location G2 is
transition between them in the path database, e.g., in an In-gateway.
paths 1, 2, and 3, items move directly between A and D, and In-Out-Gateways. In-Out-gateways are the locations that
thus, the edge eðA; DÞ. We will explain later in the section serve as both In-gateways and Out-gateways. This is the case
the meaning of the dotted circles dividing the nodes in the of many ports that may, for example, serve as an In-gateway
movement graph. for raw materials being imported and an Out-gateway for
In a global supply chain, it is possible to identify manufactured goods being exported. Fig. 2c presents such
important gateway locations that serve to connect remote an example. It is possible to split an In-Out-gateway into
94 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

separate In- and Out-gateways by matching incoming and In the Split-Only model, we can gain significant
outgoing edges carrying the same subset of items into the compression by creating a hierarchy of gids, rooted at
corresponding single direction traffic gateways. factories where items move in the largest possible groups,
Notice that gateways may naturally form hierarchies. For and pointing to successively smaller groups as items move
example, one may see a hierarchy of gateways, e.g., country down the supply chain. In this model, a single grouping
level sea ports ! region level distribution centers ! state schema provides good compression because the basic
level hubs. groups, in which objects move, are preserved throughout
the different locations, i.e., the smallest groups that reach
the stores are never shuffled, but are preserved all the way
5 DATA COMPRESSION
from the Factory. In the next section, we will present a more
5.1 Redundancy Elimination Compression general model that can accommodate both split and
RFID data contains large amounts of redundancy. Each merging of groups.
reader scans for items at periodic intervals, and thus,
generates hundreds or even thousands of duplicate readings 5.2.2 Merge-Split Model
for items in its range, which are not moving. For example, if a A more complex model of object movements is observed in
pallet stays at a warehouse for 7 days, and the reader scans a global supply chain operation, where items may merge,
for items every 30 seconds, there will be 20,160 readings of split, and groups of items can be shuffled several times. One
the form ðEP C, warehouse, timeÞ. We could compress all such case is when items move between exporting and
these readings, without loss of information, to a single tuple importing countries. At the exporting country, items merge
of the form ðEP C; warehouse, time in, time outÞ, where into successively large groups in their way from factories to
time in is the first time that the EP C was detected in the logistic centers, and finally to large shipping ports. In the
warehouse and time out the last one. importing country, the process is usually reversed, items
Redundancy elimination can be accomplished by sorting split into successively smaller groups as they move from the
the raw data on EPC and time, and generating time in and incoming port, to distribution centers, and all the way to
time out for each location by merging consecutive records individual stores. We say that movement graphs with this
for the same object staying at the same location. topology present a Merge-Split model of object movements.
A single object grouping model, such as the one used in a
5.2 Bulky Movement Compression Split-Only model would not be optimal when groups of
Since a large number of items travel and stay together items can both split and merge. A better option is to
through several stages, it is important to represent such a partition the movement graph around gateways, and define
collective movement by a single record no matter how an item grouping model at the partition level. For example,
many items were originally collected. As an example, if the exporting country would get a hierarchy of groups
1,000 boxes of milk stayed in location locA between time t1 rooted at the port and ending at the factories, while the
(time in) and t2 (time out), it would be advantageous if importing country will have a separate hierarchy rooted at
only one record is registered in the database rather than the port and ending at the individual stores. Using a single
1,000 individual RFID records. The record would have the grouping for both partitions has the problem that each
form ðgid; prod; locA ; t1 ; t2 ; 1;000Þ, where 1,000 is the count, group would have to point to many small subgroups, or
prod is the product id, and gid is a generalized id which will even just individual items that are preserved throughout
not point to the 1,000 original EPCs but instead point to the the entire supply chain, after multiple operations of merge,
set of new gids which the current set of objects move to. For split, and shuffle. Separate groupings prevent this problem
example, if this current set of objects were split into by requiring bulky movement only at the partition level,
10 partitions, each moving to one distinct location, gid will and allowing for merge, split, and even shuffling of items
point to 10 distinct new gids, each representing a record. without loss of compression.
The process iterates until the end of the object movement
where the concrete EPCs will be registered. By doing so, no 5.3 Data Generalization
information is lost but the number of records to store such Since many users are only interested in data at a relatively
information is substantially reduced. high abstraction level, data compression can be explored to
The process of selecting the most efficient grouping for group, merge, and compress data records. This type of
items, both in terms of compression and query processing, compression as opposed to the previous two compression
depends on the movement graph topology. methods is lossy, because once we aggregate the data at a
high level of abstraction, e.g., time aggregated from second
5.2.1 Split-Only Model to hour, we cannot ask queries for any level below the
In some applications, the movement graph presents a tree- aggregated one.
like structure, with a few factories near the root, ware- There are two types of data generalization: item-based,
houses and distribution centers in the middle, and a large which is the same encountered in traditional data cubes and
number of individual stores at the leaves. In such topology, does not involve spatiotemporal dimensions; and path-
it is common to observe items moving in large groups near based, which is unique to RFID data sets.
the factories and splitting into smaller groups as they Path-Level Generalization. A new type of data general-
approach individual stores. We say that movement ization, not present in traditional data cubes, is that of
graphs with this topology present an Split-Only model of merging and collapsing path stages according to time and
object movements. location concept hierarchies. For example, if the minimal
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 95

granularity of time is hour, then objects moving within the shipment sizes. Finally, for the edges that pass the above
same hour can be seen as moving together and be merged two filters, check which locations split the paths going
into one movement. Similarly, if the granularity of the through the location into two largely disjoint sets; that is,
location is shelf, objects moving to the different layers of a the locations in paths involving the gateway can be split
shelf can be seen as moving to the same shelf and be merged into two subsets, locations occurring in the path before the
into one. gateway and those occurring in the path after the gateway.
Another type of path generalization is that of expanding
different types of locations to different levels of abstraction, 6.2 Partitioning Algorithm
depending on the analysis task. For example, a transporta- The movement graph partitioning problem can be framed as
tion manager may want to collapse all movements inside a traditional graph clustering problem and we could use
stores and warehouses while expanding movements within techniques such as spectral clustering [9], [20]. But for the
trucks and transportation centers to a very detailed level. specific problem of partitioning supply chain movement
On the other hand, store managers may want to collapse all graphs, we can design a less costly algorithm that takes
object movements outside their particular stores. advantage of the topology of the graph to associate locations
An important difference between path level general- to those gateways to which they are more strongly connected.
ization and the more conventional data cube generalization The key idea behind the partitioning algorithm is that in
along concept hierarchies is that in path level aggregation, the movement graph for a typical supply chain application,
we need to preserve the path structure of the data, i.e., we locations only connect directly (without going through
need to make sure that the new times, locations, and another gateway) to a few gateway nodes. That is, very few
transitions are consistent with the original data. items in Europe reach the major ports in the United States
without first having gone through Europe’s main shipping
ports. Using this idea, we can associate each location to the
6 MOVEMENT GRAPH PARTITIONING set of gateways that it directly reaches (we use a frequency
In this section, we discuss the methods for identifying threshold to filter out gateways that are reached only
gateways, partitioning based on the movement graph, and rarely), when two locations li and lj have a gateway in
associating partitions to gateways. common we merge their groups into a single partition
containing the two locations and all their associated
6.1 Gateway Identification gateways. We repeat this process until no additional merge
In many applications, it is possible for data analysts to is possible. At the end, we do a postprocessing step where
provide the system with the complete list of gateways, this we associate very small partitions to the larger partition to
is realistic in a typical supply chain application where the which it most frequently directly connects to.
set of transportation ports is well known in advance, e.g., Analysis. Algorithm 1 presents the details of movement
Walmart knows all the major ports connecting its suppliers graph partitioning given a set of gateways. In a single scan
in Asia to the entry ports in the United States. In some other of the path database, we compute statistics on the traffic
cases, we need to discover gateways automatically. We can from each node to the different gateways. We then go
use existing graph partitioning techniques such as balanced through the list of locations merging sets of locations that
minimum cut or average minimum cut [9], to find a small set of share common gateways. Finally, we merge small clusters
edges that can be removed from the graph so that the graph into larger ones. This algorithm scales linearly with the size
is split into two disconnected components; such edges will of the path database, linearly with the number of nodes in
typically be associated with the strong traffic edges of in- or the movement graph, and quadratically with the number of
out- gateways. Gateways could also be identified by using gateways in the movement graph. We can further speed up
the concept of betweenness and centrality in social network the algorithm by running it on a random sample of the
analysis as they will correspond to nodes with high original database instead of running it on the full data. This
betweenness as defined in [13] and we can use an efficient is possible because the structure of the supply chain is
algorithm such as [5] to find them. usually fairly stable over time, and a representative random
Here, we propose a simple but effective approach to sample is enough to capture the topology of the graph.
discover gateways that works well for typical supply
Algorithm 1 Movement graph partitioning
chain operations where gateways have strong character-
Input: GðV ; EÞ: a movement graph, W  V : the set of
istics that are easy to identify. We can take a movement
gateways, D: a path database, min nodes: min. # of vertices
graph, and rank nodes as potential gateways based on the
following observations: 1) a large volume of traffic goes per partition, min connectivity: min. # of paths to gateway.
T
through gateway nodes, 2) gateways carry unbalanced Output: A partition of V into V1 ; . . . ; Vk s.t. Vi Vj ¼
traffic, i.e., incoming and outgoing edges carrying the same ; 8i 6¼ j.
tags but having very different average shipment sizes, and Method
3) gateways split paths into largely disjoint sets of nodes 1: Let C be a connection matrix, with entries C½li ; gj  that
that only communicate through the gateway. The algo- indicate the number of times that location li connects to
rithm can find gateways by eliminating first low-traffic gateway gj . Initialize every entry in C to 0.
nodes and then the nodes with balanced traffic, i.e., 2: for each path p in D’ do
checking the number of incoming and outgoing edges, and 3: for each location l in p do
the ratio of the average incoming (outgoing) shipment 4: C½l; gþ ¼ 1 where g is the next gateway after l in
sizes versus the average of the outgoing (incoming) p, or g is the previous gateway before l in p.
96 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

5: end for false if intermediate nodes were visited, gid list is the list of
6: end for items that traveled together, and measure list is a set of
7: for each location l 2 V do aggregate functions computed on the items in gid list while
8: Lg ¼ all gateways s.t. C½l; g > min connectivity they took the transition, e.g., it can be count, average
9: add l to the partition containing all gateways in Lg , if temperature, average movement cost, etc. We will elaborate
the gateways in Lg reside in different partitions merge more on concept of gids in the section describing the
the partitions and add l and the merged partition, if map table.
no partition exists that contains any element in Lg An alternative to recording edges in the graph is to
S record nodes and the history of items staying at each node.
create a new partition containing Lg flg.
10: end for The particular representation should match the application
11: for all partitions Vi s.t. jVi j < min nodes do needs, if most queries ask about properties of items at a
12: merge Vi with the partition Vj s.t. the traffic from/to given location, materializing nodes may be better, and if
most queries ask about properties during transition,
gateways in Vi to/from gateways in Vj is maximum,
materializing edges is better. And if appropriate we can
this can be determined using the connection matrix C.
materialize both.
13: end for
14: return partitions V1 ; . . . ; Vk 7.2 Map Table
Bulky movement means a large number of items move
6.3 Handling Sporadic Movements: together. A generalized identifier gid can be assigned to
Virtual Gateways every group of items that moves together, which will
An important property of gateways is that all the traffic substantially reduce the size of the tag lists at each edge.
leaving or entering a partition goes through them. How- When groups of items split into smaller groups, gid (original
ever, in reality it is still possible to have small sets of group) can be split into a set of children gids, representing
sporadic item movements between partitions that bypass these smaller groups. The map table contains entries of the
gateways. Such movements reduce the effectiveness of form hpartition; gid, contained list, contains listi, where
gateway-based materialization because path queries invol- partition is the subgraph of the movement graph where this
ving multiple partitions will need to examine some path map is applicable, contained list is the list of all gids with
segments of the graph unrelated to gateways. This problem a list of items that is a superset of the items gid, and
can be easily solved by adding a special virtual gateway to contains list is the list gids with item lists that are a subset
each partition for all outgoing and incoming traffic from of gid, or a list of individual tags if gid did not split into
and to other partitions that does not go through a gateway. smaller groups.
Virtual gateways guarantee that intergateway path queries There are two main reasons for using a map table instead
can be resolved by looking at gateway-related traffic only. of recording the complete EPC lists at each stage: 1) data
For our running example, Fig. 1, we can partition the compression and 2) query processing efficiency.
movement graph along the dotted circles, and associate Compression: First, we do not want to record each RFID
gateway G1 with the first partition and gateway G2 with the tag on the EPC list for every stay record it participated in. For
second one. In this case, we need to create a virtual gateway example, if we assume that 10,000 items move in the system
Gx to send outgoing traffic from the first partition (i.e., in groups of 10,000, 1,000, 100, and 10 through four stages,
traffic from B) that skips G1 , and another virtual gateway instead of using 40,000 units of storage for the EPCs in the
Gy to receive incoming traffic into the second partition (i.e., stay records, we use only 1,111 units2 (1,000 for the last stage
traffic to I) that skips G2 . and 100, 10, and 1 for the ones before).
Since real RFID data sets involve both merge and split
7 STORAGE MODEL movement models, the Split-Only model cannot have much
sharing and compression. Here, we adopt a Merge-Split
With the tremendous amounts of RFID data, it is crucial to model, where objects can be merged, shuffled, and split in
study the storage model. We propose to use three data many different combinations during transportation. Our
structures to store both compressed and generalized data: mapping table takes a gateway-centered model, where map-
1) an edge table, storing the list of edges, or alternatively an ping is centered around gateways, i.e., the largest merged
stay table, storing the list of nodes, 2) a map table, linking and collective moving sets at the gateways become the root
groups of items moving together, and 3) an information gids, and their children gids can be spread in both directions
table, registering path-independent information related to along the gateways. This will lead to the maximal gid
the items in the graph. sharing and gid_list compression.
7.1 Edge Table Query Processing: The second and the more important
reason for having such a map table is the efficiency in query
This table registers information on the edges of the movement
processing. By creating gid lists that are much shorter than
graph, the format is hfrom; to; t start; t end,direct; gid list; :
EPC lists, we can compute path-related queries very
measure listi, where from is the originating node, to is
quickly. To compute, for example, the average duration
the destination node, t start is the time when the items
for milk to move from the distribution center (D), to the
departed the location from, t end is the time when the items
arrived at the location to, direct is a boolean value that is 2. This figure does not include the size of the map itself which should use
true if the items moved directly between from and to and 12,221 units of storage, still much smaller than the full EPC lists.
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 97

store backroom (B), and finally to the shelf (S), we need to in the same format as direct ones, but with the flag direct set
locate the edge records for milk between the stages and to false.
intersect the EPC lists of each. By using the map, the EPC The benefit of materializing a given indirect edge in the
lists can be orders of magnitude shorter, and thus, reduce movement graph is proportional to the number of queries for
IO costs. which this edge reduces the total processing cost. Indirect
edges, involved in a path query, reduce the number of
7.3 Information Table edges that need to be analyzed and provide shorter tag lists
The information table records other attributes that that are faster to retrieve and intersect. In order for an
describe properties of the items traveling through the indirect edge to help a large number of queries, it should
edges of the movement graph. The format of the tuples in have three properties: 1) carry a large volume of traffic,
the information table is hgid list, D1 ; . . . ; Dn i, where 2) be part of a large portion of all the paths going from
gid list is the list of items that share the same values nodes in one partition of the graph to nodes in any other
on the dimensions D1 to Dn , and each dimension Di partition, and 3) be involved directly or indirectly in a large
describes a property of the items in gid list. An example number of path queries. The set of edges that best match
of attributes that may appear in the information table these characteristics are the following.
could be product, manufacturer, or weight. Each dimen-
sion of the information table can have an associated 8.2.1 Node-to-Gateway
concept hierarchy, e.g., the product dimension may have a In supply chain implementations, it is common to find a few
hierarchy such as EP C ! SKU ! product ! category. well-defined Out-gateways that carry most of the traffic
leaving a partition of the graph where items are produced,
before reaching a partition of the graph where items are
8 MATERIALIZATION STRATEGY
consumed. For example, products manufactured in China
Materialization of path segments in the movement graph may destined for exports to the United States leave the country
speedup a large number of path-related queries. Since there through a set of ports. We propose to materialize the (virtual)
is an exponential number of possible path segments that can edges from every node to the Out-gateways that it first reaches.
be precomputed in a large movement graph, it is only realistic Such materialization, for example, would allow us to
to partially materialize only those path segments that quickly determine the properties of shipments originating
provide the highest expected benefit at a reasonable cost. at any location inside China and leaving the country.
We will develop such a strategy here.
8.2.2 Gateway-to-Node
8.1 Path Queries
Another set of important nodes for indirect edge materi-
A path query requires the computation of a measure over alization are In-gateways, as most of the item traffic entering
all the items with a path that matches a given path pattern. a region of the graph where items are consumed has to go
It is of the form: q hc info, path expression, measurei, through an In-gateway. For example, imported products
where c info is a selection on the information table that coming into the United States all arrive through a set of
retrieves the relevant items for analysis; path expression is a major ports. When we need to determine which items sold
sequence of stage conditions on location and time that in the United States have paths that involve locations in
should appear in every path, in the given order but possibly foreign countries, we can easily get this information by
with gaps; and measure is a function to be computed on precomputing the list of items that arrived at the location
the matching paths. An example path query may be c info from each of the In-gateways. We propose to materialize all
¼ fproduct ¼ beef, sale_date ¼ 2006g, path expression ¼ the (virtual) edges from an In-gateway to the nodes that it
{Argentina farm A, San Mateo store S}, and measure ¼ reaches without passing through any other gateway.
average temperature, which asks for the average tempera-
ture recorded for each beef package, traveling from a certain 8.2.3 Gateway-to-Gateway
farm in Argentina to a particular store in San Mateo. Another interesting set of indirect edges to materialize are
There may be many strategies to answer a path query, the ones carrying inter-gateway traffic. For example, we want
but, in general, we will need to retrieve the appropriate tag to precompute which items leaving the Shanghai port
lists and measures for the edges along the paths involving finally arrive at the New York port. The benefit of such
the locations in the path expression; retrieve the tag list for indirect edge is twofold: First, it aggregates a large number
the items matching the info selection; intersect the lists to of possible paths between two gateways and precomputes
get the set of relevant tags; and finally, if needed, retrieve important measures on the shipments; and second, it
the relevant paths to compute the measure. allows us to quickly determine which items travel
between partitions.
8.2 Path Segment Materialization
We can model path segments as indirect edges in the Lemma 8.1. A movement graph with k partitions, p1 ; . . . ; pk ,
movement graph. For example, if we want to precompute the each partition i with pni nodes and pgi gateways, will require the
list of items moving from location li to location lj through materialization
P of a number
P of indirect edges that is bounded
any possible path, we can materialize an edge from li to lj by ki¼1 ðpni  pgi Þ þ i6¼j pgi  pgj .
that records a history of all tag movements between the Proof. For each node in each partition we will materialize at
nodes, including movements that involve an arbitrary most pgi indirect edges, this is when the node has traffic
number of intermediate locations. Indirect edges are stored to or from every gateway, the maximum number of node
98 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

Fig. 4. Graph aggregation.


Fig. 3. Location concept hierarchy.

Pk n g have associated concept hierarchies, aggregations can be


to gateway edges is then i¼1 pi  pi , the maximum performed at different levels of abstraction. We propose a
number of gateway to gateway edges occurs when every
data cube model for such movement graphs. The main
gateway has traffic to every other gateway for a
difference between the traditional cube model and the
maximum
P number of gateway to gateway edges that is
g g movement graph cube is that the former aggregates on simple
i6¼j pi  pj . u
t
dimensions and levels but the latter needs to aggregate on
path dimensions as well, which may involve path collap-
The implication of Lemma 8.1 is that the size of our
sing as a new form of generalization. In this section, we
materialization scheme is small in size, especially when
develop a model for movement graph cubes and introduce an
compared to a full materialization model in which we
efficient algorithm to compute them.
compute direct edges betweenP every pair of nodes in which
case we would require ð ki¼1 pni Þ2 additional edges. In 9.1 Movement Graph Aggregation
practice, the overhead of gateway materialization is usually With a concept hierarchy associated with locations, a path
smaller than the bound found in Lemma 8.1, the reason is can be aggregated to an abstract location by aggregating
that nodes in a partition will tend to associate with a single each location to a generalized location, collapsing the
gateway and partitions will connect to a small subset of corresponding movements and rebuilding the movement
other partitions. graph according to the new path database.
Lemma 8.2. Given a movement graph with k partitions, Location aggregation. We can use the location concept
p1 ; . . . ; pk , each partition i with pni nodes and pgi gateways, hierarchy to aggregate particular locations inside a store to
any pairwise path query q involving a path expression with a single store location, or particular stores in a region to a
two nodes ðni ; nj Þ, where ni 2 pi , nj 2 pj , can be answered by single region location. We can also completely disregard
analyzing at most pgi þ pgj þ pgi  pgj edges. certain locations not interesting for analysis, e.g., a store
Proof. In the worst case, traffic can travel from node ni to manager may want to eliminate all factory-related locations
node nj through all the gateways in partition pi and all from the movement graph in order to see a more concise
the ones in partition pj , so we need to analyze at most representation of the data. Fig. 3 presents a hierarchy of
pgi þ pgj node to/from gateway indirect edges, and in the locations, where grayed nodes represent interesting location
worst case, traffic can travel between any pair of levels. In this example, we are interested in transportation
gateways in pi and pj for a maximum number of inter- locations at the lowest level, but all store locations are
gateway indirect edges of pgi  pgj . u
t collapsed into a single node.
Fig. 4 presents a movement graph and some aggregations.
Lemma 8.2 provides a bound on the cost of query All the locations are initially at the lowest abstraction level.
processing when gateway materialization has been By generalization, the transportation-related locations are
implemented. collapsed into a single node T ransportation, and the store-
In general, when we need to answer a path query related locations into a single node Store (shown as dotted
involving all paths between two nodes, we need to retrieve circles). Then, the original single path F actory!Dock!
all edges between the nodes and aggregate their measures. Hub!Backroom!shelf is collapsed to the path F actory!
This can be very expensive if the locations are connected by T ransportation!Store in the aggregated graph. If we
a large number of distinct paths, which is usually the case completely remove transportation locations, we will get
when nodes are in different partitions of the graph. By the path F actory!Store.
using gateway materialization, we reduce this cost sig- Edge aggregation semantics. From the point of view of
nificantly, as remote nodes can always be connected the edge table, graph aggregation corresponds to merging
through a few edges to, from, and between gateways. of edge entries, but it is different from regular grouping of
fact table entries in a data cube because collapsing paths
will create edges that did not exist before, and some edges
9 RFID CUBE can be completely removed if they are not important for
So far, we have examined the movement graph at a single analysis. In a traditional data cube, fact table entries are
level of abstraction. Since items, locations (as nodes in the never created or removed, they are just aggregated into
graph), and the history of shipments along each edge all larger or smaller groups. For example, in Fig. 4, if we
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 99

remove all transportation-related locations, a new edge cells only once. The size of the full movement graph
(F actory, Store) will be created, with all the edges to and cube is thus the total number of distinct cells in all the
from transportation locations removed. RFID-cuboids. When materializing the cube or a subset
Graph aggregation involves different operations over the of RFID-cuboids, we compute all relevant cells to those
gid lists at each edge, when we remove nodes we need to RFID-cuboids without duplicating shared cells between
intersect gid lists to determine the items traveling through RFID-cuboids.
the new edge, but when we simply aggregate locations to
higher levels of abstraction (without removing them) we 9.3 Cube Computation
need instead to compute the union of the gid lists of several In this section, we introduce an efficient algorithm, that in a
edges. For example, looking at Fig. 4 in order to determine single scan of the fact table, simultaneously computes the
the gid list for the edge (F actory, Store) we need to intersect set of interesting RFID-cuboids, as defined by the user or
the gid lists of all outgoing edges from the node F actory determined through selective cube materialization techni-
with the incoming edges to the node Store; on the other ques such as those proposed in [20]. The computed RFID-
hand, if we aggregate transportation locations to a single cuboids can then be used to answer the most common
node in order to determine the gid list for the edge OLAP and path query operations efficiently, and can also
(T ransportation, Store), we need to union the gid lists of be used to quickly compute nonmaterialized RFID-cuboids
the edges ðHub; StoreÞ and ðW eighting; StoreÞ. on the fly.
Partition-based aggregation. We will aggregate each
9.2 Cube Structure partition of the graph independently, i.e., paths will be
Fact table. The fact table contains information on the divided into disjoint segments according to the partitions
movement graph and the items aggregated to the minimum defined in the movement graph, and each segment in the path
level of abstraction that is interesting for analysis. Each will be aggregated independently, without merging loca-
entry in the fact table is a tuple of the form hfrom, to, t start, tions from separate segments. This technique guarantees
t end, d1 , d2 ; . . . ; dk: gid list: measure listi, where gid list is that for any RFID-cuboid we can still use the gateway-based
list of gids that contains all the items that took the transition materialization to improve query performance. If we need
between from and to locations, starting at time t in and to compute inter-partition aggregation, it can be done at
ending at time t out, and all share the dimension values runtime by using the best available RFID-cuboids from each
d1 ; . . . ; dk for dimensions D1 ; . . . ; Dk in the info table, partition, and computing required aggregation on top of
measure list contains a set of measures computed on the those at runtime.
gids in gid list. Algorithm 2 presents an efficient cubing algorithm that
Measure. For each entry in the fact table we register the does simultaneous aggregation of every interesting cell in
gid list corresponding to the tags that match the dimension parallel, with a single scan of the path database.
values in the entry. We can also record for each gid in the Path prefix tree. The algorithm first constructs a prefix
list a set of measures recorded during shipping, such as tree for each partition of the movement graph. For this
average temperature, total weight, or count. We can use the purpose, paths are divided into disjoint fragments sepa-
gid list to quickly retrieve those paths that match a given rated by gateways; in the case when locations belonging to
slice, dice, or path selection query at any level of abstraction. separate partitions appear in the same path fragment, we
When a query is issued for aggregate measure that is separate them with a virtual gateway. Each path is
already precomputed in the cube, we do not need to access converted into a sequence of edges of the form ðfrom, to,
the path database, and all query processing can be done t in, t outÞ, the first edge of every path has a from location
directly on the aggregated movement graph. For example, if equal to the special symbol ‘ , and the last edge in the path
we record count as a measure, any query asking for counts has a to location that is the special symbol a . After the
of items moving between locations can be answered directly prefix trees have been constructed, we assign a unique gid
by retrieving the appropriate cells in the cube. When a to the items aggregated in each node of the tree that share
query asks for a measure that has not been precomputed in the same values on all item dimensions. Fig. 5 presents the
the cube, we can still use the aggregated cells to quickly path prefix tree computed on the path database of Table 1,
determine the list of relevant gids and retrieve the and the first partition in Fig. 1. Notice that we have created
corresponding paths to compute the measure on the fly. the virtual gateway Gx and added it to the prefix path.
RFID Cuboids. A cuboid in the RFID cube resides at a Cuboid materialization. After building the path prefix
level of abstraction of the location concept hierarchy, a level trees, we are ready to start building the relevant set of RFID-
of abstraction of the time dimension, and a level of cuboids. This is done by traversing each prefix tree, one
abstraction of the info dimensions. Path aggregation is used branch at a time, and generating all possible edges from the
to collapse uninteresting locations, and item aggregation is branch, including 1) direct edges that correspond to a single
used to group-related items. Cells in a movement graph RFID- node in the tree, 2) edges generated by performing path
cuboids group both items, and edges that share the same collapsing to all interesting location levels of abstraction, and
values at the RFID-cuboid abstraction level. 3) indirect edges from each location in the path to a
It is possible for two separate RFID-cuboids to share a gateway. All the cells involving these edges are then
large number of common cells, namely, all those corre- updated to include the gids associated with the edge and
sponding to portions of the movement graph that are their measures adjusted. We also do aggregation of RFID-
common to both RFID-cuboids, and that share the same cuboids that include only item dimensions (i.e., location and
item aggregation level. A natural optimization is to compute time dimensions are aggregated to all) by aggregating the
100 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

cuboids one at a time, and thus, misses the efficiency gains


provided by shared computation of multiple path segments
in parallel. In [17], since we also incur the cost of
maintaining a separate info and map table for every
RFID-cuboid, we instead keep a single map table for all
RFID-cuboids and incorporate the item dimensions in the
graph itself, so no extra overhead of accessing the info table
is required when slicing on item dimensions.

Fig. 5. Prefix tree partition 1.


10 EXPERIMENTAL EVALUATION
In this section, we report our comprehensive evaluation of
info entry for the relevant gids to the levels indicated by the the proposed model and algorithms. All the experiments
set of RFID-cuboids to compute. The simultaneous multi- were conducted on a Pentium 4 3.0 GHz, with 1.5 Gb RAM,
level, multidimensional aggregation is similar to the running Win XP; the code was written in C++ and compiled
aggregation done in multiway array aggregation [36], and with Microsoft Visual Studio 2003.
we can make use of very similar techniques to minimize the
10.1 Data Synthesis
amount of memory required to materialize the cube; the
The path databases used for performance evaluation were
idea is to sort the dimensions of the info table in decreasing
generated using a synthetic path generator. We first generate
cardinality and to build the prefix path tree in lexicographic a random movement graph with five partitions and 100 loca-
order, so that we minimize the number of cells that need to tions, each partition has a random number of gateways.
be kept in main memory. Locations inside a partition are arranged according to a
Algorithm 2 Graph cube construction producer configuration, where we simulate factories connect-
ing to intermediate locations that aggregate traffic, which in
Input: D: path database, P : location partitions, W : set of
turn connect to Out-gateways; or a consumer configuration,
gateways for each partition, and C ¼ fc1 ; . . . ; cm g: set of
where we simulate products moving from In-gateways, to
interesting RFID-cuboids to materialize intermediate locations such as distribution centers, and
Output: The cells for the RFID-cuboids in C, and a map finally to stores. We generate paths by simulating groups of
table items moving inside a partition, or between partitions, and
Method: going usually through gateways, but sometimes, also
1: Scan D once and build a prefix tree of edges for each “jumping” directly between nongateway nodes; we increase
partition. Each path in D is broken into partitions shipment sizes close to gateways. We control the number of
according to P and W , if needed virtual gateways are items moving together by a shipment size parameter, which
inserted to separate partitions. Paths are aggregated to indicates the smallest number of items moving together in a
the minimum interesting abstraction level. Nodes in the partition. Each item in the system has an entry in the path
prefix trees have the form ðfrom, to, t in, t outÞ. Prefix database and an associated set of item dimensions. We
trees should always be rooted at the gateways. characterize a dataset by N the number of paths, and S the
minimum shipment size.
2: for each prefix tree Ti do
In most of the experiments in this section, we compare
3: Assign gids to each node in the tree, by linking each
three competing models. The first model is part gw mat, it
gid to all of its direct children in the tree, and the represents a partitioned movement graph where we have
complete list of ancestors performed materialization of path segments to, from, and
4: for each branch p in Ti do between gateways. The second model is part gw no mat, it
5: Let pd be the set of direct edges represents a model where we partition the movement graph
6: compute pi the set of indirect edges created by but we do not perform gateway materialization. The third
path collapsing according to the location model is no part, which represents a movement graph that has
abstraction level of RFID-cuboids in C not been partitioned and corresponds to the model
7: compute pg the set of node to/from gateway introduced in [17].
edges
10.2 Model Size
8: aggregate each relevant cell using all elements in
pd , pi , and pg . In these experiments, we compare the sizes of three models
of movement graph materialization and the size of the
9: aggregate cells involving only item dimensions
original path database path db. For all the experiments, we
10: end for
materialize the graph at the same level of abstraction as the
11: end for
one in the path database and thus is a lossless representa-
Analysis. Algorithm 2 can be implemented efficiently, as tion of the data.
it requires a single scan of the path database, which is Fig. 6 presents the size of the four models on path
compressed into a compact prefix tree representation, databases with a varying number of paths. For this
which in turn is traversed only once to generate all relevant experiment, we can clearly see that the partitioned graph
aggregated cells in parallel. It is more efficient than the without gateway materialization part no gw mat is always
cubing algorithm proposed in [17], which computes RFID- significantly smaller than the nonpartition model no part,
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 101

Fig. 6. Fact table size versus path db size (S ¼ 300). Fig. 8. Fact table size versus shipment size (N ¼ 108;000).

Fig. 7. Map table size versus path db size (S ¼ 300). Fig. 9. Map table size versus shipment size (N ¼ 108;000).

this comparison is fair as both models materialize only map table due to shipment size. We can observe a very
direct edges. When we perform gateway materialization, significant reduction in map size, even for small ship-
the size of the model increases, but the increase is still linear ment sizes. This is a clear indication that partition level
on the size of the movement graph (much better than full map tables provide a clear advantage over a global map
materialization of edges between every pair of nodes which table. This is important not only in terms of space, but
is quadratic in the size of the movement graph), and close to also in terms of query processing, as we will see in the
the size of the model in [17]. next section.
Fig. 7 presents the size of the map table for the partitioned
part gw mat and nonpartitioned no part models. The 10.3 Query Processing
difference in size is almost a full order of magnitude. The An important contribution of our model is efficiency
reason is that our partition level maps capture the semantics in query processing. In these experiments, we generate
of collective object movements much better than [17]. This 100 random path queries that ask for a measure on the path
has very important implications in compression power, and segments for items matching an item dimension condition
more importantly, in query processing efficiency. that go from a single initial location to a single ending
Fig. 8 presents the size of the fact table, as we vary the
location and that occur within a certain time interval. We
shipment size, under four different models part no gw
compare the partition movement graph with gateway
mat, part gw mat, no part, and path db. We see that
materialization part gw mat, against the partitioned graph
compression improves as we increase shipment sizes.
Gateway materialization increases the size of the fact part no gw mat without gateway materialization, and the
table, it is still much smaller than the original path nonpartitioned graph no part. All the queries were answer-
database, except for very small shipment sizes, and it is ing a movement graph at the same abstraction level as the
also smaller than a nonpartitioned fact table. original path database. We restrict the analysis of queries
Fig. 9 presents the size of the map table, as we vary with starting and ending locations in different partitions of
shipment size, for the part gw mat and no part models. the graph. These queries are in general more challenging to
This experiment isolates the effect on compression of the answer. Based on our experiments on single-partition
102 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

Fig. 10. Query IO versus path db size (S ¼ 300). Fig. 11. Query IO versus shipment size (N ¼ 108;000).

queries, our method has a big advantage due to their


compact map tables.
For the case of nonpartitioned graph, we use the same
query processing algorithm presented in [17]. For the
case of partitioned graph without gateway materialization,
we retrieve all the relevant edges from the initial node to the
gateways in its partition, edges between gateways, and
edges from the gateways in the ending location’s partition
and the location. In this case, we do not perform intergate-
way join of the gid lists, but the overhead of such join can be
small if we keep an intergateway join table, or if our
measure does not require matching of the relevant edges in
both partitions. For the gateway materialization case, we
retrieve only relevant gateway-related edges. For this
method, we compute the cost using tag lists instead of gid
lists on materialized edges to and from gateways. Fig. 12. Cubing time versus path db size (S ¼ 300; N ¼ 108;000).
In Fig. 10, we analyze query performance with respect to
path database size. We see that the gateway-based materi- Fig. 13 presents the total size of the cells in the five
alization method is the clear winner, its cost is almost an cuboids for the case of a partitioned graph without
order of magnitude smaller than the method proposed in gateway materialization and a nonpartitioned graph. The
[17]. We also see that our method has the lowest growth in compression advantage of our method increases for larger
cost with respect to database size. Fig. 11 presents the same database sizes. This advantage becomes even more
analysis but for a path database with different minimum important as more cuboids materialize. We can thus use
shipment sizes. Our model is the clear winner in all cases, our model to create compact movement graphs at different
and as expected performance improves with larger ship- levels of abstraction, and furthermore, use them to answer
ment sizes. queries significantly more efficiently than competing
models. If we want even better query processing speed,
10.4 Cubing
we can sacrifice some compression and perform gateway-
For the cubing experiments, we compute a set of five based materialization.
random cuboids, with significant shared dimensions among
them, i.e., the cuboids share a large number of interesting
locations and item dimensions. We are interested in the study 11 CONCLUSIONS
of such cuboids because it captures the gains in efficiency In this paper, we have introduced a new, gateway-based
that we would obtain if we used our algorithm to compute a movement graph model for warehousing massive, transporta-
full movement graph cube, as ancestor/descendant cuboids tion-based RFID data sets. This model captures the essential
in the cube lattice benefit most from shared computation. semantics of supply chain application as well as many other
Fig. 12 presents the total runtime to compute five cuboids, RFID applications that explore object movements of similar
we can see that shared significantly outperforms the level by nature. It provides a clean and concise representation of
level cubing algorithm presented in [17]. For the case when large RFID data sets. Moreover, it sets up a solid foundation
cuboids are very far apart in the lattice, the shared for modeling RFID data and facilitates efficient and effective
computation has a smaller effect and our algorithm RFID data compression, data cleaning, multidimensional
performs similarly to [17]. data aggregation, query processing, and data mining.
GONZALEZ ET AL.: MODELING MASSIVE RFID DATA SETS: A GATEWAY-BASED MOVEMENT GRAPH APPROACH 103

[8] Q. Chen, Z. Li, and H. Liu, “Optimizing Complex Event


Processing over RFID Data Streams,” Proc. 2008 Int’l Conf. Data
Eng. (ICDE ’08), pp. 1442-1444, Apr. 2008.
[9] R.K. Chung, Spectral Graph Theory, vol. 92. Am. Math. Soc. 1997.
[10] EPCIS standard v. 1.0.1, Standard, EPCglobal, http://www.epc
globalinc.org/standards/epcis, 2008.
[11] K. Finkenzeller, RFID-Handbook, second ed. Wiley and Sons, 2003.
[12] C. Floerkemeier and M. Lampe, “Issues with RFID Usage in
Ubiquitous Computing Applications,” Pervasive Computing
(PERVASIVE) Lecture Notes in Compute Science, Am. Math.
Soc., 2006.
[13] L.C. Freeman, “A Set of Measures of Centrality Based on
Betweenness,” Sociometry, vol. 40, pp. 35-41, 1977.
[14] H. Gobioff, S. Smith, J.D. Tygar, and B. Yee, “Smart Cards in
Hostile Environments,” Proc. Second USENIX Workshop Electronic
Commerce, 1996.
[15] H. Gonzalez, J. Han, and X. Li, “Flowcube: Constructuing RFID
Flowcubes for Multi-Dimensional Analysis of Commodity Flows,”
Proc. 2006 Int’l Conf. Very Large Data Bases (VLDB ’06), Sept. 2006.
[16] H. Gonzalez, J. Han, and X. Li, “Mining Compressed Commodity
Workflows from Massive RFID Data Sets,” Proc. 2006 Conf.
Fig. 13. Cube size versus path db size (S ¼ 300; N ¼ 108;000). Information and Knowledge Management (CIKM ’06), Nov. 2006.
[17] H. Gonzalez, J. Han, X. Li, and D. Klabjan, “Warehousing and
A set of efficient methods have been developed in this Analysis of Massive RFID Data Sets,” Proc. 2006 Int’l Conf. Data
Eng. (ICDE ’06), Apr. 2006.
study for movement graph construction, gateway identifi- [18] H. Gonzalez, J. Han, and X. Shen, “Cost-Conscious Cleaning of
cation, gateway-based graph partitioning, efficient storage Massive RFID Data Sets,” Proc. 2007 Int’l Conf. Data Eng. (ICDE
’07), Apr. 2007.
structuring, multidimensional aggregation, graph cube [19] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart,
computation, and cube-based query processing. This M. Venkatrao, F. Pellow, and H. Pirahesh, “Data Cube: A
weaves an organized picture for systematic modeling and Relational Aggregation Operator Generalizing Group-By,
Cross-Tab and Sub-Totals,” Data Mining and Knowledge
implementation of such an RFID data warehouse. Our Discovery, vol. 1, pp. 29-54, 1997.
implementation and performance study shows that the [20] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing
Data Cubes Efficiently,” Proc. 1996 ACM Special Interest Group on
methods proposed here are much more efficient in both Management of Data Int’l Conf. Management of Data (SIGMOD ’96),
storage cost, cube computation, and query processing pp. 205-216, June 1996.
comparing with a previous study [17] that uses a global [21] “13.56 MHz ISM Band Class 1 Radio Frequency Identification Tag
Interface Specification,” technical report, MIT Auto ID Center,
map table without gateway-based movement graph model- 2003.
ing and partitioning. [22] S.R. Jeffery, M. Garofalakis, and M.J. Franklin, “Adaptive
The gateway-based movement graph model proposed Cleaning for RFID Data Streams,” Proc. 2006 Int’l Conf. Very Large
Data Bases (VLDB ’06), Sept. 2006.
here captures the semantics of bulky, sophisticated, but [23] S.R. Jeffrey, G. Alonso, M.J. Franklin, W. Hong, and J. Widom, “A
collective object movements, including merging, shuffling, Pipelined Framework for Online Cleaning of Sensor Data
Streams,” Proc. 2006 Int’l Conf. Data Eng. (ICDE ’06), Apr. 2006.
and splitting processes. Its applications are not confined to [24] N. Khoussainova, M. Balazinska, and D. Suciu, “Peex: Extracting
RFID data sets but can also be extended to other bulky Probabilistic Events from RFID Data,” Proc. 2008 Int’l Conf. Data
object movement data. However, further study is needed to Eng. (ICDE ’08), pp. 1480-1482, Apr. 2008.
[25] C. Lee and C. Chung, “Efficient Storage Scheme and Query
model and warehouse objects with scattered movements, Processing for Supply Chain Management Using RFID,” Proc.
such as traffic on highways where each vehicle moves 2008 ACM Special Interest Group on Management of Data Int’l Conf.
Management of Data (SIGMOD ’08), pp. 291-302, June 2008.
differently from others. [26] EPCglobal Object Name Service (ONS) 1.0.1, Standard, EPCglobal,
http://www.epcglobalinc.org/standards/ons, 2008.
[27] J. Rao, S. Doraiswamy, H. Thakar, and L.S. Colby, “A Deferred
REFERENCES Cleansing Method for RFID Data Analytics,” Proc. 2006 Int’l Conf.
[1] S. Agarwal, R. Agrawal, P.M. Deshpande, A. Gupta, J.F. Naughton, Very Large Data Bases (VLDB ’06), Sept. 2006.
R. Ramakrishnan, and S. Sarawagi, “On the Computation of [28] Reader Protocol (RP) Standard, Standard, EPCglobal, http://
Multidimensional Aggregates,” Proc. 1996 Int’l Conf. Very Large www.epcglobalinc.org/standards/rp, 2006.
Data Bases (VLDB ’96), pp. 506-521, Sept. 1996. [29] S. Sarma, “Integrating RFID,” ACM Queue, vol. 2, no. 7, pp. 50-57,
[2] Application Level Events (ALE) Standard, Standard, EPCglobal, Oct. 2004.
http://www.epcglobalinc.org/standards/ale, 2008. [30] S.E. Sarma, S.A. Weis, and D.W. Engels, “RFID Systems and
[3] Y. Bai, F. Wang, P. Liu, C. Zaniolo, and S. Liu, “RFID Data Security and Privacy Implications,” Proc. Workshop Cryptographic
Processing with a Data Stream Query Language,” Proc. 2007 Int’l Hardware and Embedded Systems, pp. 454-470, 2002.
Conf. Data Eng. (ICDE ’07), pp. 1184-1193, Apr. 2007. [31] A. Shukla, P.M. Deshpande, and J.F. Naughton, “Materialized
[4] K. Beyer and R. Ramakrishnan, “Bottom-Up Computation of View Selection for Multidimensional Data Sets,” Proc. 1998 Int’l
Sparse and Iceberg Cubes,” Proc. 1999 ACM Special Interest Group Conf. Very Large Data Bases (VLDB ’98), pp. 488-499, Aug. 1998.
on Management of Data Int’l Conf. Management of Data (SIGMOD [32] Tag Data Standard v. 1.4, Standard, EPCglobal, http://www.epc
’99), pp. 359-370, June 1999. globalinc.org/standards/tds/, 2008.
[5] U. Brandes, “A Faster Algorithm for Betweenness Centrality,” [33] Class 1 Generation 2 UHF Air Interface Protocol Standard Gen 2,
J. Math. Sociology, vol. 25, pp. 163-177, 2001. Standard, EPCglobal, http://www.epcglobalinc.org/standards/
[6] S. Chari, C. Jutla, J.R. Rao, and P. Rohatgi, “A Cautionary Note uhfc1g2, 2007.
Regarding Evaluation of AES Candidates on Smart-Cards,” Proc. [34] E. Wu, Y. Diao, and S. Rzvi, “High-Performance Complex Event
Second Advance Encryption Standard (AES) Candidate Conf., 1999. Processing Over Streams,” Proc. 2006 ACM Special Interest Group
[7] S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing on Management of Data Int’l Conf. Management of Data (SIGMOD
and OLAP Technology,” SIGMOD Record, vol. 26, pp. 65-74, 1997. ’06), pp. 407-418, June 2006.
104 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

[35] D. Xin, J. Han, X. Li, and B.W. Wah, “Star-Cubing: Computing Hong Cheng received the PhD degree from
Iceberg Cubes by Top-Down and Bottom-Up integration,” Proc. University of Illinois at Urbana-Champaign in
2003 Int’l Conf. Very Large Data Bases (VLDB ’03), pp. 476-487, Sept. 2008. She is an assistant professor in the
2003. Department of Systems Engineering and En-
[36] Y. Zhao, P.M. Deshpande, and J.F. Naughton, “An Array-Based gineering Management at the Chinese Univer-
Algorithm for Simultaneous Multidimensional Aggregates,” Proc. sity of Hong Kong. Her primary research
1997 ACM Special Interest Group on Management of Data Int’l Conf. interests include data mining, machine learning,
Management of Data (SIGMOD ’97), pp. 159-170, May 1997. and database systems. She has published more
than 20 research papers in international con-
Hector Gonzalez received the PhD degree ferences, journals, and book chapters, including
from the University of Illinois at Urbana-Cham- SIGMOD, VLDB, SIGKDD, ICDE, ICDM, SDM, ACM Transactions on
paign in 2008. Prior to the PhD degree, he KDD, and Data Mining and Knowledge Discovery, and received
completed the MBA degree from Harvard research papers awards at ICDE ’07, SIGKDD ’06, and SIGKDD ’05.
Business School in 1999. He is a research
scientist working at Google Research. He Xiaolei Li received the BS, MS, and PhD
conducts research on data mining, data ware- degrees in computer science from the Univer-
housing, and information integration. sity of Illinois at Urbana-Champaign in May of
2002, 2004, and 2008, respectively. He is
currently working for Microsoft AdCenter Labs
researching a variety of topics related to
advertising and other online services.
Jiawei Han is a professor in the Department of
Computer Science at the University of Illinois.
He has been working on research into data
mining, data warehousing, stream data mining,
spatiotemporal and multimedia data mining,
biological data mining, social network analysis,
text and Web mining, and software bug mining, Diego Klabjan is an associate professor in the
with more than 400 conference and journal Department of Industrial Engineering and Man-
publications. He has chaired or served in over agement Sciences, Northwestern University. He
100 program committees of international con- received the doctorate degree from the School
ferences and workshops and also served or is serving on the editorial of Industrial and Systems Engineering of the
boards for Data Mining and Knowledge Discovery, IEEE Transactions Georgia Institute of Technology in 1999. He
on Knowledge and Data Engineering, Journal of Computer Science and joined the University of Illinois at Urbana-
Technology, and Journal of Intelligent Information Systems. He is Champaign in 1999. In 2007, he became an
currently the founding editor-in-chief of ACM Transactions on Knowl- associate professor at Northwestern. He was the
edge Discovery from Data (TKDD). He has received IBM Faculty recipient of the first prize of the 2000 Transpor-
Awards, the Outstanding Contribution Award at the International tation Science Dissertation Award and has received various other
Conference on Data Mining (2002), ACM Service Award (1999), ACM awards with graduate students. He is a former president of the Institute
SIGKDD Innovation Award (2004), and IEEE Computer Society of Operations Research and the Management Sciences (INFORMS)
Technical Achievement Award (2005). He is an ACM and IEEE fellow. Aviation Applications Section. He is an associate editor for Naval
His book Data Mining: Concepts and Techniques (Morgan Kaufmann) Research Logistics and two areas in Operations Research. His research
has been used worldwide as a textbook. is focused on transportation, supply chain management, radio frequency
identification, and large-scale optimization.

Tianyi Wu received the BS degree in computer


science from Fudan University, Shanghai, in
2005. He is currently working toward the PhD
degree in the Department of Computer Science,
University of Illinois at Urbana-Champaign. His
research interests include data mining, Web
search, and text data management. He is a
member of the ACM.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

Vous aimerez peut-être aussi