Vous êtes sur la page 1sur 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
1

An Enhanced Visualization Method to Aid


Behavioral Trajectory Pattern Recognition
Infrastructure for Big Longitudinal Data
Hua Fang and Zhaoyang Zhang

Abstract—Big longitudinal data provide more reliable information for decision making and are common in all kinds of fields. Trajectory
pattern recognition is in an urgent need to discover important structures for such data. Developing better and more
computationally-efficient visualization tool is crucial to guide this technique. This paper proposes an enhanced projection pursuit (EPP)
method to better project and visualize the structures (e.g. clusters) of big high-dimensional (HD) longitudinal data on a
lower-dimensional plane. Unlike classic PP methods potentially useful for longitudinal data, EPP is built upon nonlinear mapping
algorithms to compute its stress (error) function by balancing the paired weights for between and within structure stress while
preserving original structure membership in the high-dimensional space. Specifically, EPP solves an NP hard optimization problem by
integrating gradual optimization and non-linear mapping algorithms, and automates the searching of an optimal number of iterations to
display a stable structure for varying sample sizes and dimensions. Using publicized UCI and real longitudinal clinical trial datasets as
well as simulation, EPP demonstrates its better performance in visualizing big HD longitudinal data.

Index Terms—Enhanced projection pursuit, Pattern recognition, Visualization, Longitudinal data.

1 I NTRODUCTION
Building up the infrastructure for big data visualization is compared to two typical PP methods: Andrews Curves
is a challenge but an urgent need [1], [2]. Big longitudinal and Grand Tour, as all three methods are potentially useful
data are generated every day from all kinds of fields in for big longitudinal data visualization where high dimen-
industry, business, government and research institutes [3]– sionality (HD) and repeated measures for each dimension
[15]. Discovering useful information from heterogeneous are common. Section II introduces the involvement of An-
data requires trajectory pattern recognition techniques [16]– drews Curves and Grand Tour; Section III discusses the EPP
[22]. However, developing visualization tools is crucial to function and algorithms; Section IV includes the comparison
guide this technique, which can facilitate the discovery, pre- of EPP with other methods using real datasets; Section V
sentation and interpretation of important structures buried evaluates EPP with simulated and artificial data; Section VI
in complex high-dimensional data. Projection Pursuit (PP) concludes this study.
is a classical technique to data visualization, first introduced
by Friedman and Tukey in 1974 for exploratory analysis of
multivariate data [23]. The basic idea of PP is to design and 2 A NDREWS C URVES AND G RAND TOUR
numerically optimize a projection index function to locate
interesting projections from high- to low-dimensional space. Proposed in 1972, Andrews Curve has been widely uti-
From these interesting projections, revealed structures such lized in many disciplines such as biology, neurology,
as clusters could be analyzed [24]–[27]. PP is based on the sociology and semiconductor manufacturing. The algo-
assumption that redundancy exists in the data and the major rithm of Andrews Curve was designed to project high
characteristics are concentrated into clusters. For example, dimensional data onto a predefined Fourier series [46],
principle components analysis is one of the typical PP and if any structures exist, they may be visible via An-
methods, widely used for dimension reduction by removing drews Curves. Briefly, for each case X = {x1 , x2 , . . . , xd },
uninteresting directions of variations [23], [26], [28]–[39] and which is a vector of measurements, we define a series

√1 , sin(s), cos(s), sin(2s), cos(2s), . . . , then the Andrews
now often used as an initialization before high dimensional 2
data mapping and clustering [26], [40]–[45]. Curve is calculated as
In the present study, our newly developed PP method x1
fx (s) = √ + x2 sin(s) + x3 cos(s) + x4 sin(2s) + . . . , (1)
2
• Hua Fang is corresponding author. for −π < s < π . Each case may be viewed as a curve
• Hua Fang (E-mail: hfang2@umassd.edu)is with Department of Com- between −π and π , and structures may be viewed as dif-
puter and Information Science, Department of Mathematics, University
of Massachusetts Dartmouth, 285 Old Westport Rd, Dartmouth, MA, ferent clusters of curves. Since 1972, several variants of the
02747, and Department of Quantitative Health Sciences, University of Andrews Curve have been proposed. Andrews himself also
Massachusetts Medical School, Worcester, MA, 01605. Zhaoyang Zhang proposed to use different integers to generalize fx (s),
(E-mail: zzhang1@umassd.edu) is with College of Engineering, University
of Massachusetts Dartmouth and Department of Quantitative Health fx (s) =x1 sin(n1 s) + x2 cos(n1 s)
Sciences, University of Massachusetts Medical School. (2)
+ x3 sin(n2 s) + x4 cos(n2 s) + . . . .

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
2

By testing n1 = 2, n2 = 4, n3 = 8, ..., the author TABLE 1: Notations


concluded that Equation (2) is more space filling (ie., a curve
Symbols Definitions
whose range contains the entire 2-dimensional unit square,
or the mapping is continuous) than Equation (1) but more X a vector of measurements
difficult to interpret when used for visual inspection [46]. A Xi ,Xj The i-th and j -th cases
three-dimensional Andrews plot was suggested by Khattree Xi ′ ,Xj ′ The projections in a 2D space
and Naik [47], s angle, 0 < s < π
√ λ Linearly independent over the rational
2fx (s) = x1 + x2 [sin(s) + cos(s)] + a1 (s), a2 (s) Orthonormal basis for a 2D plane
(3)
x3 [sin(s) − cos(s)] + x4 [sin(2s) + cos(2s)] + . . . . N Number of cases
T Sample times
As every projection point is exposed to a sine function d Number of dimensional
and a cosine function, the advantage in Equation (3) is that
p Number of components
the trigonometric terms do not simultaneously vanish at any Dij Distance between Xi ′ and Xj ′

given s, which establishes an interesting relation between Dij Distance between Xi and Xj
the Andrews Curve and the eigenvectors of a symmetric S Stress
positive definite circular covariance matrix. ci Cluster label of case i
k The optimal number of clusters
Different from Andrews Curve, Grand Tour proposed by D Average distance
Asimov [48] and Buja [49] in 1985 is an interactive visual- ℓ Total data size
ization technique. The basic idea is to rotate the projected α Weight of the within-cluster stress
plane from all angles and search the interesting structures β Weight of the between-cluster stress
[50]–[56]. However, these methods were not ideal in terms SEP P Total EPP stress
of intensive computation, computer storage, and projection fx Low-dimensional projections of data
recovery turns out to be difficult. Motivated by Andrews ℓ Size of the simulated data
Curve, Wegman and Shen [57] suggested an algorithm for
computing an approximate two-dimensional grand tour,
called pseudo grand tour which means that the tour does in which
d
not visit all possible orientations of a projection plane. The ′ X
X i1 = xk a1k ,
method has recognized advantages, such as easy calcula-
k=1
tion, time efficiency in visiting any regions with different (9)
d
plane orientations, and easy recovery of projection. Briefly, ′ X
X i2 = xk a2k .
assuming d is an even number without loss of generality
k=1
[57], let a1 (s) be
q According to (6), a1 (s) and a2 (s) form an orthonormal
2

d sin(λ1 s), cos(λ1 s), . . . , sin(λd/2 s), cos(λd/2 s) , (4) basis for a two dimensional plane. Because of the depen-
dence between sin(·) and cos(·), this two-dimensional plane
and a2 (s) be is not quite space filling. However, the algorithm based on
(8) is much computationally convenient. By taking the inner
q
2

d cos(λ1 s), − sin(λ1 s), . . . , cos(λd/2 s), − sin(λd/2 s) , product as in (7), a [a1 (s), a2 (s)] plane is constructed on
(5) which the high dimensional data are projected.
where λi has irrational values. a1 (s) and a2 (s) have the Different from Andrews Curve and Pseudo Grand Tour,
following properties, our new enhanced projection pursuit (EPP) method was
d/2 built upon Sammon Mapping, assuming not all big longi-
2X
ka1 (s)k22 sin2 (λj s) + cos2 (λj s) = 1,

= tudinal data fit trigonometric functions or transformation.
d j=1 Sammon mapping has been one of the most successful
(6) nonlinear multidimensional scaling methods [58], [59] pro-
d/2
2 2 X 2 2

posed by Sammon in 1969 [60]. It is highly effective and
ka2 (s)k2 = cos (λj s) + (− sin) (λj s) = 1,
d j=1 robust to hyper-spherical and hyper-ellipsoidal clusters [60].
The idea is to minimize the error (called “stress”) between
and the distances of projected points and the distances of the
original data points by moving around projected data points
ha1 (s), a2 (s)i =
on lower dimensional space (mostly 2-dimenstional place)
d/2
2X to best represent those in high-dimensional space. Since
(sin(λj s) cos(λj s) − cos(λj s) sin(λj s)) = 0, (7) its advent, much effort concentrated on improving the
d j=1
optimization algorithm [61]–[65] but rarely on modifying
where h·i is the inner product of two vectors a1 (s) and Sammon’s Stress function [64].
a2 (s). Then, the projections of data points on the plane Our proposed EPP modified Sammon Stress Function
formed by the two basic vectors are by balancing two weights for between and within cluster
 ′ ′
 errors, respectively, in order to better segment and visualize
fxi (s) = X i 1 , X i 2 , i = 1, 2, . . . , N, (8) structures (e.g., clusters) on a projected two-dimensional

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
3

plane while preserving their cluster membership in high- Algorithm 1(a): Main EPP Algorithm
dimensional space. To this end, we developed a nonlinear Input: longitudinal data Xi , i = 1, 2, ..., N , cluster labels
algorithm to compute EPP stress. Besides, our EPP was ci , 0 ≤ i ≤ N , and a range of stress error bound ε,
developed to automate the searching and finding of the maximum iteration number, lmax , weight change step δ
optimal number of iterations to display a stable structure, Output: α, β , fx and SEP P
for varying sample sizes and dimensions. Our goal is to 1: Initialize X′ by PCA
aid the trajectory pattern recognition of longitudinal data. 2: Set initial values for SEP P 0 → ∞, l = 0, m = 0, α0 and
To evaluate the performance of EPP, one big publicized β0 (α0 , β0 > 0, α0 + β0 = 1)
data set and two real longitudinal random controlled trials 3: for l = 0 to lmax do
(RCT) datasets including a large web-delivered trial data 4: fx l = arg min SEP P (αl , βl , fx )
were used to compare EPP with Andrews Curve and Pseudo fx
Grand Tour. Simulated big longitudinal data sets based on 5: SEP P l = SEP P (αl+1 , βl+1 , fx l )
RCT data parameters were used to evaluate EPP perfor- 6: while αl , βl > 0, αl + βl = 1 do
mance at varying conditions. 7: if SEP P (αl +δ, βl −δ, fxl ) < SEP P (αl , βl , fxl ) then
8: αl+1 = αl+1 + δ , βl+1 = βl+1 − δ
9: else
10: if SEP P (αl − δ, βl + δ, fxl ) < SEP P (αl , βl , fxl )
3 E NHANCED P ROJECTION P URSUIT (EPP) then
11: αl+1 = αl+1 − δ , βl+1 = βl+1 + δ
In longitudinal data analyses, repeated measures for each 12: else
dimension result in inevitable high-dimensionality. Built 13: break
upon Sammon Mapping [60], we proposed an Enhanced 14: end if
Projection Pursuit method (EPP) where the Sammon stress 15: end if
becomes a special case of EPP stress when there is only end
one cluster and the weights of within and between cluster
16: while
17: if SEP P l − SEP Pl−1 ≤ ε then
stresses are equal. EPP is used to aid trajectory pattern 18: break
recognition for such longitudinal data. The key idea of EPP 19: end if
is to balance the weights of between and within cluster 20: end for
variations in order to achieve better visualization, thus aid
pattern recognition for high dimensional (HD) longitudinal
data. Table 1 summarizes the notations used hereafter. First,
we define our data size and high dimensional space. in which


2
Definition 1. let N be the number of cases (e.g., subjects,  1 X Dij − Dw ij

 SEP P w = P ∗ ∗
data points, etc. ), Xi , 1 ≤ i ≤ N be a vector of d variables Dij Dij


i<j,ci =cj


i<j
{x1 , x2 , ..., xd }, each Xi be repeatedly measured with t times, 2 (12)

then the data has dt dimensional space and the entire data size 
 1 X Dij − Dbij
is ℓ = N dt. e.g, with N cases, Xi is a dt dimensional vector

 SEP P b = P ∗ ∗
Dij Dij


i<j,ci 6=cj

{x11 , x12 , ..., x1t , x21 , x22 , ..., x2t , ..., xd1 , xd2 , ..., xdt }. i<j

Then, the projection of the big longitudinal data from where P1 ∗ is a constant for a given big HD longitudinal
Dij
high-dimensional space onto a two-dimensional plane is i<j
 2
2
defined as follows: P ∗
Dij −Dw ij P (Dij

−Dbij )
data, Dij
∗ and Dij
∗ are the
i<j,ci =cj i<j,ci 6=cj
Definition 2. To project big HD longitudinal data onto a two within-cluster and between-cluster stress, respectively, Dw ij
dimensional plane and similar to [60], let the distance between is the within cluster Euclidean distance between case i and
any two vectors of Xi and Xj in the dt high dimensional space be j if they are in the same cluster, and Dbij is the between
∗ ∗
defined by Dij , Dij = kXi − Xj k2 , where k·k2 is the Euclidean cluster Euclidean distance between case i and j if they
norm. belong to different clusters; α and β are the weights of
Based on Definition 1 and 2, randomly choose an initial the within-cluster stress and between-cluster stress, respec-
two-dimensional space for the N vectors of X′ and compute tively, a, β > 0 and α + β = 1. Note again that the Sammon
all the two dimensional distances Dij , 1 ≤ i, j ≤ N, i 6= j . stress is a special case of EPP stress when there is only one
The Sammon Stress [60] is calculated as: cluster, ci = 1, i = 1, 2, ..., N and the weights of within
2 cluster and between cluster stresses are equal, α = β .

1 X Dij − Dij EPP algorithm aims to obtain an interesting two-
Ssam = P ∗ ∗ . (10)
Dij Dij i<j
dimensional projection of the original high dimensional data
i<j that minimizes its stress function. The optimization problem
Different from Equation (10), the Stress of EPP stress is expressed as
function SEP P is expressed as the weighted sum of the minimize αSEP P w + βSEP P b
within-cluster stress SEP P w and between-cluster stress (13)
subject to α, β > 0, α + β = 1.
SEP P b ,
SEP P = αSEP P w + βSEP P b (11) Definition 3. To minimize SEP P (α, β, fx ) where fx stands

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
4

for the projections of Dwij and Dbij , the gradual approximation TABLE 2: Real Data Description
algorithm works as: Given a fixed pair of α and β , update the
Name Waveform TDTA QuitPrimo
values of fx where SEP P has the minimum value, that is, keep
updating α and β until there are no changes according to (12). Cases(N ) 5000 109 1320
Components(p) 21 5 3
 Time points(t) 1 4 6
 α = α + δ, β = β − δ, if SEP P (α + δ, β − δ, fx ) < SEP P
 Total data size(ℓ) 105,000 2,180 23,760
α = α − δ, β = β + δ, if SEP P (α − δ, β + δ, fx ) < SEP P Clusters(c) 3 3 4

 α = α, β = β, otherwise
(14)
The Waveform data were generated by a clustering data
The main EPP algorithm is shown in Algorithm 1(a). The
generator described in [70] and published by [66], [70]. It
embedded gradual approximation algorithm is displayed in
consists of 5000 cases, each with 21 attributes (ℓ = 105, 000).
Algorithm 1(b) to minimize SEP P given α and β ; the values
There are 3 clusters of waves identified for testing algo-
of fx were retained when SEP P has the minimum value.
rithms. Figure 1 shows the performance of the three PP
Specifically, the EPP algorithm initialize X′ based on the
methods for waveform datasets. Clearly, Andrews Curve
results from PCA; update fx according to Algorithm 1(b)
and grand tour were unable to visualize the three classes
based on Equation (15), calculate the EPP stress and update
while the EPP demonstrated its projection power in visual-
α and β , with a weight change step δ based on Equation izing the 3-cluster structure.
(14). If the difference between two consecutive stress values
is less than the threshold ε, the algorithm stops. Repeat this TABLE 3: Mean values of TDTA Data
process until reaching the maximum iteration number, lmax .
t1 t2 t3 t4
fx l = arg min SEP P (αl , βl , fx ). (15)
fx C1 133 128 127 127
C2 138 127 133 134
Benefits C3 113 112 115 112
Algorithm 1(b): Algorithm for Updating fx
C1 116 116 113 111
Input: Projections X′ , α and β , error bound ε, maximum
C2 116 115 114 115
iteration number mmax , SEP P (0) → ∞ Family Norm C3 101 102 100 99
Output: SEP P (m+1) and fx (m+1)
1: for m = 0 to mmax do
2: fx (m+1) = fx (m) − τ · ∆(m)
TABLE 4: Standard Deviation of TDTA Data
3: SEP P
(m+1)
= SEP P (α, β, f x (m+1) )
if SEP P (m+1) − SEP P (m) ≤ ε then

4:
t1 t2 t3 t4
5: break
6: end if C1 13.89 21.15 17.35 22.60
7: end for C2 14.88 25.80 16.21 14.54
Benefits C3 26.38 16.11 16.59 19.95
Note that in Algorithm 1(b) when updating fx , fx (m) are C1 9.94 7.19 9.88 12.20
the projections of the data on the two-dimensional space at C2 7.22 7.98 9.14 9.63
the m-th iteration, τ is the iteration step size  which is set at Family Norm C3 12.96 10.81 12.17 9.47
(m) ∂2S
∂S EP P (m)

0.3 or 0.4 according to [60], ∆(m) = ∂f EP P
(m)

2
x ∂ (fx (m)
) TDTA data were collected from a longitudinal culturally-
and w = P−2 ∗ is a constant. Then the first-order derivative tailored smoking cessation intervention for 109 Asian Amer-
Dij
i<j
ican smokers (ℓ = 2, 180). It contains three identified
with respect to fx is shown in Equation (16) and the second-
culturally-adaptive response patterns [43]. This interven-
order derivative is expressed in Equation (17).
tion used three components: Cognitive behavioral therapy,
Unlike nonlinear mapping algorithm [60], the EPP algo-
cultural tailoring, and nicotine replacement therapy. The
rithm further automates the searching and finds the optimal
first two were measured by scores on Perceived Risks and
number of iterations to display a stable structure by learning
Benefits, Family and Peer Norms, and Self-efficacy scales.
the change of SEP P in two consecutive iterations at a range
Each scale has four repeated measures, total 20 attributes,
of varying error bounds, sample size and the number of
of which only Perceived Benefits and Family Norms were
dimensions.
used using our multiple imputation based fuzzy clustering
method discussed elsewhere [71]–[73]. As shown in Figure
2, two of the three clusters projected by Andrews Curve was
4 EPP P ERFORMANCE IN C ASE S TUDIES
completely overlapped, while Grand Tour seems to perform
Our EPP method was tested on 3 real datasets, including as good as EPP for this longitudinal dataset. The parameters
one publicized [66] and two random controlled trial (RCT) of TDTA data are shown in Table 3 and Table 4.
datasets [43], [67]–[69]. These data features are summarized QuitPrimo dataset includes 1320 cases (ℓ = 23, 760) with
in Table 2. missing values about 8.4%. This study aims to evaluate an

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
5

N h D∗ −D i 
fx (m) − Xj ′(m) if cp = cj ,
P j wj
w


(m)  Dj∗ Dwj
∂SEP P j=1,j6=p
= N (16)
∂fx (m)
h D∗ −D i 
fx (m) − Xj ′(m) if cp 6= cj .
P j bj
w


Dj∗ Dbj

j=1,j6=p

N
 
1
 (fx (m) −Xj ′(m) )2  Dj∗ −Dwj
Dj∗ − Dwj −
P
w 1+ if cp = cj ,


∂ 2 SEP P (m) Dj∗ Dwj Dw Dwj

j
j=1,j6=p
 2 = N

 (fx (m) −Xj ′(m) )2  Dj∗ −Dbj
 (17)
(m) 1
Dj∗ − Dbj −
P
∂ fx w 1 + if cp 6= cj .


Dj∗ Dbj Db Db

j j
j=1,j6=p

8 1

6 0.5

4 0

2
−0.5

0
−1
4 6 8 10 −1.5 −1 −0.5 0 0.5 1 1.5

(a) Andrews Curve (b) Grand tour (c) EPP

Fig. 1: Projection Pursuit of Waveform data using Andrews Curve, grand tour and proposed EPP

150 35
1
30
100 0.5

25
0
50
20 −0.5
0
15 −1

−50 10 −1.5
0 0.2 0.4 0.6 0.8 1 20 25 30 35 40 −1 −0.5 0 0.5 1

(a) Andrews Curve (b) Grand tour (c) EPP

Fig. 2: Projection Pursuit of TDTA data using Andrews Curve, grand tour and proposed EPP

integrated informatics solution to increase access to web- TABLE 5: Mean values of QuitPrimo Data
delivered smoking cessation support. The data is collected
via an online referral portal about three components: 1) t1 t2 t3 t4 t5 t6
My Mail, 2) Online Community, 3) Our Advice. Each of C1 0.747 0.154 0.017 0.025 0.006 0.000
the first three component has 6 monthly values measured C2 1.091 0.465 0.139 0.080 0.139 0.043
during 6 months. Figure 3 again showcases the strength of MM C3 0.047 0.000 0.000 0.000 0.000 0.000
EPP over the other two methods for this big longitudinal C4 0.659 0.157 0.003 0.000 0.000 0.000
dataset. Projected four patterns were overlapped using An-
drews Curve while and the blue and green patterns were C1 5.708 8.601 8.736 6.902 3.997 3.638
overlapped to a noticeable degree using the Grand Tour. C2 5.708 8.601 8.736 6.902 3.997 3.638
OA C3 0.888 0.100 0.000 0.000 0.000 0.000
Table 5 and 6 show the mean values and standard deviations
of QuitPrimo dataset, respectively. C4 6.345 8.686 5.857 1.213 0.007 0.000
C1 0.284 0.020 0.006 0.006 0.003 0.006
The optimal pairs, α and β , for included real longi- C2 0.455 0.080 0.011 0.021 0.021 0.000
tudinal datasets TDTA and QuitPrimo given fx can be OC C3 0.006 0.000 0.000 0.000 0.000 0.000
detected by the following steps. Initialize a pair of values, C4 0.275 0.031 0.014 0.000 0.000 0.000
e.g., (0.5,0.5), and calculate the stress of the proposed EPP
method by Equation (10) and (11). Increase α and decrease
β , or vice versa, by a boundary parameter δ , e.g., δ = 0.1,
to obtain a new stress value. Updating α and β until the stress values no longer decease, we can obtain the optimal

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
6

12
1.2

1
10
0.8
8 0.6

0.4
6
0.2

4 0

-0.2
2
-0.4

0 -0.6

-0.8
−2
0 5 10 15 -1.5 -1 -0.5 0 0.5

(a) Andrews Curve (b) Grand tour (c) EPP

Fig. 3: Projection Pursuit of QuitPrimo data using Andrews Curve, grand tour and proposed EPP

0.05
EPP 0.1 EPP
Sammon Sammon
0.04
0.08
Stress

Stress
0.03 0.06

0.04
0.02

0.02
0.01
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
α α

(a) TDTA dataset (b) QuitPrimo dataset

Fig. 4: Finding an optimal pair of weights that balance the between and within stresses for TDTA and QuitPrimo using
EPP (blue line is reference line from Sammon’s Stress)

TABLE 6: Standard Deviation of QuitPrimo Data TABLE 7: Cluster Information for TDTA and QuitPrimo

t1 t2 t3 t4 t5 t6 cluster 1 2 3 4
C1 1.718 1.124 0.237 0.339 0.106 0.000 TDTA # of cases 50 31 16 -
C2 1.595 2.437 0.979 0.732 1.079 0.462 proportions 0.52 0.32 0.16 -
MM C3 0.384 0.000 0.000 0.000 0.000 0.000
QuitPrimo # of cases 356 187 490 287
C4 1.972 1.246 0.059 0.000 0.000 0.000
proportions 0.27 0.14 0.37 0.22
C1 1.972 1.246 0.059 0.000 0.000 0.000
C2 2.431 0.875 0.893 1.394 1.172 1.484
OA C3 2.249 0.457 0.000 0.000 0.000 0.000 rameters from the two real datasets, TDTA and QuitPrimo.
C4 2.490 1.067 3.384 1.797 0.083 0.000 The data generation procedure is described as follows:
C1 2.490 1.067 3.384 1.797 0.083 0.000
C2 0.996 0.463 0.103 0.178 0.206 0.000 1) Fit the multivariate normal distribution to TDTA
OC C3 0.078 0.000 0.000 0.000 0.000 0.000 and the zero-inflated Poisson mixture distribution to
C4 0.783 0.194 0.186 0.000 0.000 0.000 the QuitPrimo web trial data [71], respectively, and
learn the parameters such as cluster mean vectors
and standard deviations, the results are shown in
Table 3, 4, 5 and 6;
weights α and β for the within and between cluster stresses.
2) Set the number of cases of each cluster according to
As shown in Figure 4(a) and Figure 4(b), the optimal weights
the proportion of each cluster (Table 7);
of (0.8, 0.2) were founded for TDTA and QuitPrimo data,
3) Generate data for each cluster based on the model
respectively.
parameters from (1) and cluster size (2).
4) Randomize data from (3) to generate a complete
5 EPP P ERFORMANCE U SING S IMULATED L ON - dataset;
5) Repeat (1-4) and generate datasets
GITUDINAL DATA
with varying sample sizes, N is in
The proposed EPP was also evaluated using simulated data. {100, 200, 300, 500, 1000, 5000}, dT DT A =
First, simulated longitudinal data were generated using pa- 20, dQuitP rimo = 18, and ℓT DT A =

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
7

{2000, 4000, 6000, 10000, 20000, 100000}, ℓQuitP rimo =data [40]–[45], [67], [74], [76]. Using the publicized UCI
{1800, 3600, 4800, 9000, 18000, 36000}. dataset, real longitudinal RCT datasets and a number of
simulated big longitudinal data, EPP showcases its clear and
Figure 5 displays the EPP projection based on the TDTA
better projection power with respect to high-dimensionality,
parameters using different sample sizes. From N = 100 to
sample sizes and error bounds for the change between
N = 5000, the clusters are clearly projected. With smaller
iterations with satisfactory computational costs. Embedding
sample sizes, the data points are more spread within the
EPP into different trajectory pattern recognition systems and
cluster. The red and green clusters are closer to each other
further reducing computational time for bigger data would
compared to the blue cluster.
be future tasks. Testing EPP on more big longitudinal data
Based on the QuitPrimo parameters, EPP again clearly
could further warrant its robustness.
projected the four clusters across a range of data size ℓ. The
blue cluster is always far apart from the red cluster; the other
three clusters always touch each other as shown in Figure 6.
Using the same simulated data sets, the optimal number ACKNOWLEDGMENT
of iterations were tested for the proposed EPP method using This project was supported by National Institute of Health
a different number of sample sizes or dimensions. In Figure (NIH) grants 1R01DA033323-01, and NIH National Center
7 (a), the number of dimensions was fixed at 20, and the for Advancing Translational Sciences 5UL1TR000161-04 pi-
data sizes ℓ were varied from 2,000 to 100,000. In Figure 7 lot study award to Dr. Fang.
(b), the data sizes ℓ was fixed at 100,000, and the number of
dimensions d were varied from 2 to 100. For all conditions,
the change between iterations (ε) was varied from 10−3 to R EFERENCES
10−6 .
The findings indicate that across different sample sizes [1] H. Fang, Z. Zhang, C. J. Wang, M. Daneshmand, C. Wang, and
or dimensions or the change of stresses between iterations H. Wang, “A survey of big data research,” IEEE network, vol. 29,
no. 5, p. 6, 2015.
(ε), the optimal number of iteration seem to be always below [2] P. Fox and J. Hendler, “Changing the equation on scientific data
350. visualization,” Science(Washington), vol. 331, no. 6018, pp. 705–708,
Furthermore, using the same data generation procedure, 2011.
an artificial longitudinal dataset was generated with stan- [3] M. Kumagai, J. Kim, R. Itoh, and T. Itoh, “Tasuke: a web-based
visualization program for large-scale resequencing data,” Bioinfor-
dardized mean and variance-covariance matrices to evalu- matics, vol. 29, no. 14, pp. 1806–1808, 2013.
ate the EPP performance. The mean vector was set as 0.2, 0.5, [4] M. Keller, J. Beutel, O. Saukh, and L. Thiele, “Visualizing large
and 0.8 for three clusters [74], [75], the correlation matrix sensor network data sets in space and time with vizzly,” in Local
Computer Networks Workshops (LCN Workshops), 2012 IEEE 37th
(standardized variance-covariance matrix) was set with 1 Conference on. IEEE, 2012, pp. 925–933.
at the diagonal and other matrix elements were randomly [5] D. E. Pires, R. C. de Melo-Minardi, C. H. da Silveira, F. F. Campos,
selected from {0.1, 0.3, 0.5} [74], [75]. The data size was and W. Meira, “acsm: noise-free graph-based signatures to large-
varied from 1,000 to 500,000 and dimensions were changed scale receptor-based ligand prediction,” Bioinformatics, vol. 29,
no. 7, pp. 855–861, 2013.
from 10 to 100. The different colored planes stand for the [6] O. Morozova and M. A. Marra, “Applications of next-generation
four settings for the change of stresses between iterations sequencing technologies in functional genomics,” Genomics,
(ε), 10−3 , 10−4 , 10−5 , and 10−6 . As shown in Figure 8, the vol. 92, no. 5, pp. 255–264, 2008.
optimal number of iterations seem to be always below 500 [7] D. P. Bartel, “Micrornas: genomics, biogenesis, mechanism, and
function,” cell, vol. 116, no. 2, pp. 281–297, 2004.
across different sample sizes, dimensions and error bounds [8] C. Lynch, “Big data: How do your data grow?” Nature, vol. 455,
(ε) for the change between iterations. Using 500 iterations no. 7209, pp. 28–29, 2008.
could be an empirical rule for setting the iterations for EPP. [9] B. H. Brinkmann, M. R. Bower, K. A. Stengel, G. A. Worrell, and
Overall, in terms of computational time, EPP cost 11 and 22 M. Stead, “Large-scale electrophysiology: acquisition, compres-
sion, encryption, and storage of big data,” Journal of neuroscience
seconds for projecting real TDTA and QuitPrimo data while methods, vol. 180, no. 1, pp. 185–192, 2009.
up to 9 minutes assuming the worst scenario of N = 20,000 [10] F. Frankel and R. Reid, “Big data: Distilling meaning from data,”
and dt = 100. Nature, vol. 455, no. 7209, pp. 30–30, 2008.
[11] M. Waldrop, “Big data: wikiomics,” Nature News, vol. 455, no.
7209, pp. 22–25, 2008.
6 C ONCLUSION [12] A. McAfee, E. Brynjolfsson, T. H. Davenport, D. Patil, and D. Bar-
ton, “Big data,” The management revolution. Harvard Bus Rev, vol. 90,
Pattern visualization is a challenging field. A robust projec- no. 10, pp. 61–67, 2012.
tion pursuit method could enormously ease pattern recog- [13] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,
nition. Our enhanced projection pursuit (EPP), a variant M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A
distributed storage system for structured data,” ACM Transactions
of classic Sammon Mapping, balances the weights of be- on Computer Systems (TOCS), vol. 26, no. 2, p. 4, 2008.
tween and within cluster variations and better project big [14] W. Tan, M. B. Blake, I. Saleh, and S. Dustdar, “Social-network-
high dimensional longitudinal data onto two-dimensional sourced big data analytics,” IEEE Internet Computing, no. 5, pp.
plane using nonlinear mapping algorithms. Compared to 62–69, 2013.
[15] D. Tracey and C. Sreenan, “A holistic architecture for the internet
classical Andrews Curve and Grand Tour, our EPP method of things, sensing services and big data,” in Cluster, Cloud and Grid
seems to perform consistently well and was more robust Computing (CCGrid), 2013 13th IEEE/ACM International Symposium
to such data. Different from the two methods, EPP was on. IEEE, 2013, pp. 546–553.
not built upon trigonometric functions as not all longi- [16] H. Fang, H. Wang, C. Wang, and M. Daneshmand, “Using prob-
abilistic approach to joint clustering and statistical inference: An-
tudinal datasets follow this assumption, especially those alytics for big investment data,” in Big Data (Big Data), 2015 IEEE
longitudinal random controlled trial (RCT) or observational International Conference on. IEEE, 2015, pp. 2916–2918.

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
8

Data size = 2000 Data size = 4000 Data size = 6000


1
1
1

0.8
0.5
0.6 0.5

0.4
0
0.2 0
0
−0.5
−0.2
−0.5
−0.4

−0.6 −1
−1
−0.8

−1 −1.5
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Data size = 10000 Data size = 20000 Data size = 100000


1
1 0.8

0.6

0.5 0.4
0.5
0.2

0
0
0 −0.2

−0.4

−0.5 −0.6
−0.5
−0.8

−1
−1 −1
−1.2

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

Fig. 5: EPP for simulated longitudinal data using TDTA parameters and ℓ is from 2000 to 100000

[17] Z. Zhang, H. Fang, and H. Wang, “Visualization aided engage- [29] J. A. Rice and B. W. Silverman, “Estimating the mean and co-
ment pattern validation for big longitudinal web behavior in- variance structure nonparametrically when the data are curves,”
tervention data,” IEEE 17th international Conference on E-health Journal of the Royal Statistical Society. Series B (Methodological), pp.
Networking, Application & Services, 2015. 233–243, 1991.
[18] H. Fang, C. Johnson, C. Stopp, and K. A. Espy, “A new look [30] S. Pezzulli and B. Silverman, “Some properties of smoothed
at quantifying tobacco exposure during pregnancy using fuzzy principal components analysis for functional data,” Computational
clustering,” Neurotoxicology and teratology, vol. 33, no. 1, pp. 155– Statistics, vol. 8, pp. 1–1, 1993.
165, 2011. [31] B. W. Silverman et al., “Smoothed functional principal components
[19] H. Fang, V. Dukic, K. E. Pickett, L. Wakschlag, and K. A. Espy, analysis by choice of norm,” The Annals of Statistics, vol. 24, no. 1,
“Detecting graded exposure effects: A report on an east boston pp. 1–24, 1996.
pregnancy cohort,” Nicotine & Tobacco Research, p. ntr272, 2012. [32] G. Boente and R. Fraiman, “Kernel-based functional principal
[20] Z. Zhang and H. Fang, “Multiple imputation based clustering components,” Statistics & probability letters, vol. 48, no. 4, pp. 335–
validation (miv) for big longitudinal trial data with missing values 345, 2000.
in ehealth,” Journal of Medical System, 2016. [33] J. O. Ramsay, Functional data analysis. Wiley Online Library, 2006.
[21] H. Fang, K. A. Espy, M. L. Rizzo, C. Stopp, S. A. Wiebe, and [34] P. Hall and M. H.-N., “On properties of functional principal
W. W. Stroup, “Pattern recognition of longitudinal trial data with components analysis,” Journal of the Royal Statistical Society: Series
nonignorable missingness: An empirical case study,” International B (Statistical Methodology), vol. 68, no. 1, pp. 109–126, 2006.
journal of information technology & decision making, vol. 8, no. 03, pp. [35] F. Yao and T. Lee, “Penalized spline models for functional prin-
491–513, 2009. cipal component analysis,” Journal of the Royal Statistical Society:
[22] K. A. Espy, H. Fang, D. Charak, N. Minich, and H. G. Taylor, Series B (Statistical Methodology), vol. 68, no. 1, pp. 3–25, 2006.
“Growth mixture modeling of academic achievement in children [36] D. Gervini, “Free-knot spline smoothing for functional data,” Jour-
of varying birth weight risk.” Neuropsychology, vol. 23, no. 4, p. nal of the Royal Statistical Society: Series B (Statistical Methodology),
460, 2009. vol. 68, no. 4, pp. 671–687, 2006.
[23] J. Friedman and J. Tukey, “A projection pursuit algorithm for [37] N. Locantore, J. Marron, D. Simpson, N. Tripoli, J. Zhang, K. Co-
exploratory data analysis,” IEEE Transactions on Computers, vol. hen, G. Boente, R. Fraiman, B. Brumback, C. Croux et al., “Robust
C-23, no. 9, pp. 881–890, Sept 1974. principal component analysis for functional data,” Test, vol. 8,
[24] J. B. Kruskal, “Toward a practical method which helps uncover the no. 1, pp. 1–73, 1999.
structure of a set of multivariate observations by finding the linear [38] R. J. Hyndman, S. Ullah et al., “Robust forecasting of mortality and
transformation which optimizes a new index of condensation,” in fertility rates: a functional data approach,” Computational Statistics
Statistical Computation. Academic Press, New York, 1969, pp. 427– & Data Analysis, vol. 51, no. 10, pp. 4942–4956, 2007.
440. [39] D. Gervini, “Robust functional estimation using the median and
[25] E. R.-M., J. Y. Goulermas, T. Mu, and J. F. Ralph, “Automatic spherical principal components,” Biometrika, vol. 95, no. 3, pp. 587–
induction of projection pursuit indices,” Neural Networks, IEEE 600, 2008.
Transactions on, vol. 21, no. 8, pp. 1281–1295, 2010. [40] H. Fang, V. Dukic, K. E. Pickett, L. Wakschlag, and K. A. Espy,
[26] M. C. Jones and R. Sibson, “What is projection pursuit?” Journal of “Detecting graded exposure effects: A report on an east boston
the Royal Statistical Society. Series A (General), pp. 1–37, 1987. pregnancy cohort,” Nicotine & Tobacco Research, p. ntr272, 2012.
[27] G. Eslava and F. H. C. Marriott, “Some criteria for projection [41] H. Fang, C. Johnson, C. Stopp, and K. A. Espy, “A new look
pursuit,” Statistics and Computing, vol. 4, no. 1, pp. 13–20, 1994. at quantifying tobacco exposure during pregnancy using fuzzy
[28] J. Dauxois, A. Pousse, and Y. Romain, “Asymptotic theory for the clustering,” Neurotoxicology and teratology, vol. 33, no. 1, pp. 155–
principal component analysis of a vector random function: some 165, 2011.
applications to statistical inference,” Journal of multivariate analysis, [42] H. Fang, M. L. Rizzo, H. Wang, K. A. Espy, and Z. Wang, “A new
vol. 12, no. 1, pp. 136–154, 1982. nonlinear classifier with a penalized signed fuzzy measure using

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
9

Data size = 1800 Data size = 3600 Data size = 5400


2
1.5
1
1.5
1
0.5
1
0.5
0 0.5

0
-0.5 0

-0.5 -0.5
-1

-1 -1
-1.5

-1.5
-1.5
-2
-0.5 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 2

Data size = 9000 Data size = 18000 Data size = 36000


2
1.5
1.5

1.5 1
1

1 0.5
0.5

0 0
0.5

-0.5 -0.5
0
-1 -1
-0.5
-1.5 -1.5

-1 -0.5 0 0.5 1 1.5 2 2.5 -0.5 0 0.5 1 1.5 -0.5 0 0.5 1 1.5

Fig. 6: EPP for simulated longitudinal data using QuitPrimo parameters and from 1800 to 36000

400 350
ε=10−3 ε=10−3
350 −4
ε=10 300 ε=10−4
300 ε=10−5 ε=10−5
250
ε=10−6 ε=10−6
250
Iterations

Iterations

200
200
150
150
100
100

50 50

0 0
100 200 300 500 1000 5000 0 20 40 60 80 100
Sample size N Dimension (d)

(a) Simulated data, ℓ from 2000 to 100000 (b) Simulated data, ℓ fixed at 100,000

Fig. 7: The optimal number of iterations for EPP at different number of sample sizes or dimensions for simulated data

effective genetic algorithm,” Pattern recognition, vol. 43, no. 4, pp. [48] D. Asimov, “The grand tour: a tool for viewing multidimensional
1393–1401, 2010. data,” SIAM Journal on Scientific and Statistical Computing, vol. 6,
[43] H. Fang, S. DiFranza, Z. Zhang, D. Ziedonis, and J. Allison, no. 1, pp. 128–143, 1985.
“Pattern recognition approach to culturally-tailored behavioral [49] A. Buja and D. Asimov, “Grand tour methods: an outline,” Com-
interventions for smoking cessation: Dose and timing,” in Society puting Science and Statistics, vol. 17, pp. 63–67, 1986.
for Research on Nicotine and Tobacco, 2014. [50] A. Buja, C. Hurley, and J. Mcdonald, “A data viewer for multi-
[44] H. Fang, K. A. Espy, M. L. Rizzo, C. Stopp, S. A. Wiebe, and variate data,” in Colorado State Univ, Computer Science and Statistics.
W. W. Stroup, “Pattern recognition of longitudinal trial data with Proceedings of the 18 th Symposium on the Interface p 171-174(SEE N
nonignorable missingness: An empirical case study,” International 89-13901 05-60), 1987.
journal of information technology & decision making, vol. 8, no. 03, pp. [51] D. Cook, A. Buja, and J. Cabrera, “Direction and motion control in
491–513, 2009. the grand tour,” in Computing Science and Statistics: Proceedings of
[45] H. Fang, J. Allison, B. Barton, Z. Zhang, G. Olendzki, and Y. Ma, the 23rd Symposium on the Interface, 1991, pp. 180–183.
“Pattern recognition approach for behavioral interventions: An [52] D. Cook, A. Buja, and C. Hurley, “Grand tour and projection
application to a dietary trial,” in Society of Behavioral Medicine. Ann pursuit(a video),” ASA Statistical Graphics Video Lending Library,
Behav Med., 2014. 1993.
[46] D. F. Andrews, “Plots of high-dimensional data,” Biometrics, pp. [53] D. Cook, A. Buja, J. Cabrera, and C. Hurley, “Grand tour and
125–136, 1972. projection pursuit,” Journal of Computational and Graphical Statistics,
[47] R. Khattree and D. N. Naik, “Andrews plots for multivariate vol. 4, no. 3, pp. 155–172, 1995.
data: some new suggestions and applications,” Journal of statistical [54] D. Cook and A. Buja, “Manual controls for high-dimensional data
planning and inference, vol. 100, no. 2, pp. 411–425, 2002. projections,” Journal of computational and Graphical Statistics, vol. 6,

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2017.2653815, IEEE
Transactions on Big Data
10

implementation study,” Implementation Science, vol. 10, no. 1, p.


154, 2015.
[70] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification
and regression trees. CRC press, 1984.
500 [71] Z. Zhang and H. Fang, “Multiple-vs non-or single-imputation
−3
ε=10 based fuzzy clustering for incomplete longitudinal behavioral
−4 intervention data,” in 2016 IEEE First International Conference on
400 ε=10
Connected Health: Applications, Systems and Engineering Technologies
−5
ε=10 (CHASE), June 2016, pp. 219–228.
Iterations

300
[72] Z. Zhang, H. Fang, and H. Wang, “Multiple imputation
ε=10−6 based clustering validation (miv) for big longitudinal trial
200
data with missing values in ehealth,” J. Med. Syst.,
vol. 40, no. 6, pp. 1–9, Jun. 2016. [Online]. Available:
100
http://dx.doi.org/10.1007/s10916-016-0499-0
[73] ——, “A new mi-based visualization aided validation index for
0
6000 mining big longitudinal web trial data,” IEEE Access, vol. 4, pp.
2272–2280, 2016.
4000 20
15 [74] H. Fang, G. P. Brooks, M. L. Rizzo, K. A. Espy, and R. S.
2000 10 Barcikowski, “Power of models in longitudinal study: Findings
5 from a full-crossed simulation design,” The Journal of Experimental
Sample size (N) 0 0 Education, vol. 77, no. 3, pp. 215–254, 2009.
Dimension (d)
[75] H. Fang, “hlmdata and hlmrmpower: Traditional repeated mea-
sures vs. hlm for multilevel longitudinal data analysis-power and
Fig. 8: The optimal number of iterations for EPP algorithm type i error rate comparison,” in Proceedings of the Thirty-First
Annual SAS Users Group Conference, SAS Institute Inc., Cary, NC,
for the artificial longitudinal data with varied sample sizes, 2006.
dimensions and error bounds (ε) for the change between [76] Y. Ma, B. Olendzki, J. Wang, G. Persuitte, W. Li, H. Fang, P. Mer-
iterations riam, N. Wedick, I. Ockene, A. Culver, K. Schneider, G. Olendzki,
Z. Zhang, T. Ge, J. Carmody, and S. Pagoto, “Randomized trial of
single- versus multi-component 1 dietary goals on weight loss and
diet quality in individuals with metabolic syndrome,” in Annals of
no. 4, pp. 464–480, 1997. Internal Medicine, 2014.
[55] G. W. Furnas and A. Buja, “Prosection views: Dimensional infer-
ence through sections and projections,” Journal of Computational
and Graphical Statistics, vol. 3, no. 4, pp. 323–353, 1994.
[56] C. Hurley and A. Buja, “Analyzing high-dimensional data with
motion graphics,” SIAM Journal on Scientific and Statistical Comput-
ing, vol. 11, no. 6, pp. 1193–1211, 1990.
[57] E. Wegman and J. Shen, “Three-dimensional andrews plots and
the grand tour,” Computing Science and Statistics, pp. 284–284, 1993.
[58] A. Dayanik, “Feature interval learning algorithms for classifica- Hua Fang is an Associate Professor in the
tion,” Knowledge-Based Systems, vol. 23, no. 5, pp. 402–417, 2010. Department of Computer and Information Sci-
[59] J. Hu, W. Deng, J. Guo, and W. Xu, “Learning a locality dis- ence, Department of Mathematics, University
criminating projection for classification,” Knowledge-Based Systems, of Massachusetts Dartmouth, 285 Old West-
vol. 22, no. 8, pp. 562–568, 2009. port Rd, Dartmouth, MA, 02747; Department
[60] J. W. Sammon, “A nonlinear mapping for data structure analysis,” of Quantitative Health Sciences, University of
IEEE Transactions on computers, vol. 18, no. 5, pp. 401–409, 1969. Massachusetts Medical School, Worcester, MA,
[61] J. Mao and A. K. Jain, “Artificial neural networks for feature ex- 01605; Division of Biostatistics and Health Ser-
traction and multivariate data projection,” Neural Networks, IEEE vices Research, Department of Quantitative
Transactions on, vol. 6, no. 2, pp. 296–317, 1995. Health Sciences, University of Massachusetts
Medical School. Dr. Fang’s research interests
[62] R. C. T. Lee, J. R. Slagle, and H. Blum, “A triangulation method
include computational statistics, research design, statistical modeling
for the sequential mapping of points from n-space to two-space,”
and analyses in clinical and translational research. She is interested
Computers, IEEE Transactions on, vol. 100, no. 3, pp. 288–292, 1977.
in developing novel methods and applying emerging robust techniques
[63] E. Pekalska, D. de Ridder, R. P. Duin, and M. A. Kraaijveld, “A
to enable and improve the studies that can have broad impact on the
new method of generalizing sammon mapping with application
treatment or prevention of human disease.
to algorithm speed-up,” in ASCI, vol. 99, 1999, pp. 221–228.
[64] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global
geometric framework for nonlinear dimensionality reduction,”
Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
[65] L. Yang, “Sammon’s nonlinear mapping using geodesic dis-
tances,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of the
17th International Conference on, vol. 2. IEEE, 2004, pp. 303–306.
[66] A. Asuncion and D. Newman, “Uci machine learning repository,”
2007. Zhaoyang Zhang received the B.S. degree in
[67] S. S. Kim, S.-H. Kim, H. Fang, S. Kwon, D. Shelley, and D. Ziedonis, science and the M. S. degree in electrical en-
“A culturally adapted smoking cessation intervention for korean gineering from Xidian University, Xian, China,
americans: A mediating effect of perceived family norm toward in 2007 and 2010, respectively. He is currently
quitting,” Journal of Immigrant and Minority Health, pp. 1–10, 2014. pursuing his Ph.D. degree in the College of
[68] T. K. Houston, R. S. Sadasivam, D. E. Ford, J. Richman, M. N. Engineering, University of Massachusetts, Dart-
Ray, and J. J. Allison, “The quit-primo provider-patient internet- mouth, MA, USA. His current research interests
delivered smoking cessation referral intervention: a cluster- include wireless healthcare, wireless body area
randomized comparative effectiveness trial: study protocol,” Im- networks, big data and cyber-physical systems.
plement Sci, vol. 5, p. 87, 2010.
[69] T. K. Houston, R. S. Sadasivam, J. J. Allison, A. S. Ash, M. N.
Ray, T. M. English, T. P. Hogan, and D. E. Ford, “Evaluating the
quit-primo clinical practice eportal to increase smoker engagement
with online cessation interventions: a national hybrid type 2

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Vous aimerez peut-être aussi