Vous êtes sur la page 1sur 60

Kha lun tt nghip Nguyn Vit Cng

i
LI CM N

Em xin by t lng knh trng v bit n su sc ti thy gio, tin s H
QUANG THY, Trng i hc Cng ngh, HQG H Ni v tin s ON SN,
i hc Tohoku, Nht Bn hng dn v ng vin em rt nhiu trong qu trnh
lm lun vn.
Em xin c gi li cm n ti cc Thy, C trong Trng i hc Cng
Ngh, i hc Quc Gia H Ni v nhm Xeminar thuc b mn Cc H thng
Thng tin, nhng ngi dy d, gip v ch bo cho em trong sut qu trnh hc
tp.
Cui cng, con xin gi li bit n ti gia nh, ni sinh thnh, nui dng
v ng vin con rt nhiu trong thi gian qua.
H Ni ngy 20/05/2006
Sinh vin



Nguyn Vit Cng
Kha lun tt nghip Nguyn Vit Cng

ii
TM TT
Biu din vn bn l mt trong nhng cng on quan trng nht v c quan
tm u tin trong cc vn x l vn bn. N c nh hng rt ln n cc bi ton tm
kim vn bn, phn lp, phn cm hay tm tt vn bn Kha lun ny trnh by v
nghin cu mt phng php biu din vn bn mi da trn c s l thuyt tp m v p
dng vo bi ton phn lp vn bn. Ni dung ca kha lun tp trung vo cc vn
sau:
1. Trnh by mt s phng php biu din vn bn thng thng, trong , kha
lun i su vo cch biu din theo m hnh vector, tc mi vn bn s c biu din
nh mt vector c cc thnh phn l cc t kha c mt hoc khng c mt trong vn bn.
Sau , kha lun tm hiu phng php biu din vn bn trong my tm kim.
2. Trnh by v l thuyt tp m, v cp mt cch biu din vn bn mi da
trn cc khi nim m. T xut hng gii quyt khi xut hin cc t ng ngha
trong vn bn.
3. Tin hnh th nghim cch biu din mi ny vo bi ton phn lp vn bn.
Ch ra mt s kt qu phn lp v so snh vi phng php biu din theo m hnh vector
thng thng. T rt ra mt s kt lun v hng pht trin tip theo.
Kha lun tt nghip Nguyn Vit Cng

iii

MC LC
LI CM N ..........................................................................................................i
TM TT ...............................................................................................................ii
MC LC............................................................................................................. iii
M U.................................................................................................................1
Chng 1. KHAI PH D LIU VN BN........................................................3
1.1. Tng quan v khai ph d liu................................................................3
1.1.1. Khi nim............................................................................................3
1.1.2. Cc bc ca qu trnh khai ph d liu ............................................3
1.1.3. ng dng ca khai ph d liu...........................................................5
1.2. Mt s bi ton trong khai ph d liu vn bn......................................6
1.2.1. Tm kim vn bn...............................................................................6
1.2.2. Phn lp vn bn.................................................................................7
Chng 2. CC PHNG PHP C BN BIU DIN VN BN ...............10
2.1. Tin x l vn bn ................................................................................10
2.2. M hnh Logic.......................................................................................12
2.3. M hnh phn tch c php ...................................................................14
2.4. M hnh khng gian vector ...................................................................15
2.4.1. M hnh Boolean ..............................................................................17
2.4.2. M hnh tn sut ...............................................................................17
2.5. Biu din vn bn trong my tm kim.................................................20
2.5.1. Gii thiu v my tm kim..............................................................20
2.5.2. M hnh biu din vn bn trong my tm kim ..............................21
Chng 3. BIU DIN VN BN S DNG CC KHI NIM M............23
Kha lun tt nghip Nguyn Vit Cng

iv
3.1. L thuyt m.........................................................................................23
3.1.1. Tp m..............................................................................................23
3.1.2. Cc php ton trn tp m................................................................25
3.1.3. Quan h m.......................................................................................27
3.1.4. Cc php ton trn quan h m ........................................................27
3.2. Biu din vn bn s dng cc khi nim m ......................................29
3.2.1. Khi nim m ...................................................................................30
3.2.2. Biu din vn bn .............................................................................32
3.2.3. xut gii php cho vn ng ngha.........................................32
Chng 4. CC PHNG PHP PHN LP VN BN................................35
4.1. Tng quan v bi ton phn lp............................................................35
4.2. Cc thut ton phn lp ........................................................................36
4.2.1. Phn lp da trn thut ton Naive Bayes........................................36
4.2.2. Phn lp da trn thut ton K - Nearest Neighbor (KNN) .............38
4.2.3. Phn lp da vo thut ton cy quyt nh.....................................39
4.2.4. Phn lp s dng Support Vector Machines (SVM)........................41
Chng 5. MT S KT QU THC NGHIM..............................................43
5.1. Tp d liu v tin x l.......................................................................43
5.2. Cng c v phng php phn lp .......................................................44
5.3. Kt qu thc nghim.............................................................................45
KT LUN V HNG PHT TRIN............................................................53
TI LIU THAM KHO.....................................................................................55



Kha lun tt nghip Nguyn Vit Cng

1
M U
Ngy nay, s pht trin mnh m ca Internet dn n s bng n thng tin v
nhiu mt k c v ni dung ln s lng. Ch bng mt thao tc tm kim n gin, ta c
th nhn v mt khi lng khng l cc trang web c cha thng tin lin quan ti ni
dung ta tm kim. Tuy nhin, chnh s d dng ny cng mang n cho con ngi rt
nhiu kh khn trong vic cht lc ra cc thng tin c ch thu c cc tri thc mi.
Pht hin tri thc v khai ph d liu l cu tr li mi nht cho vn ny nhm pht
hin ra cc tri thc mi t khi d liu khng l m con ngi c c.
Trong cc loi d liu th vn bn l loi d liu ph bin m con ngi thng
gp phi nht. M hnh biu din vn bn ph bin hin nay l m hnh khng gian
vector, trong mi vn bn c biu din bng mt vector ca cc t kha. Tuy nhin
bi ton khai ph d liu vn bn thng gp phi mt s kh khn nh tnh nhiu chiu
ca vn bn, tnh nhp nhng ca ngn ng Trong kha lun ny, chng ti xin cp
n mt cch biu din vn bn mi: biu din da trn cc khi nim m. Trong , mi
khi nim s c xc nh bi mt tp cc t kha lin quan. V mc lin quan ca
khi nim n vn bn s c xc nh bng hm tch hp m cc t kha . Sau khi
c mt tp cc khi nim lin quan n mt hay nhiu ch cn phn lp, mi vn bn
s c xem nh l mt vector c cc thnh phn l cc khi nim m .
Vi lng thng tin dng vn bn s ca Internet, mt yu cu ln t ra i
vi chng ta l lm sao t chc v tm kim thng tin c hiu qu nht. Phn lp (phn
loi) thng tin l mt trong nhng gii php hp l cho yu cu trn. Kha lun s trnh
by mt s thut ton phn lp tiu biu v a ra hng thc nghim cho phng php
biu din vn bn da trn cc khi nim m.
Chng ti p dng thut ton KNN (k ngi lng ging gn nht) v phn mm
WEKA (K-ngi lng ging gn nht) tin hnh phn lp. Phn thc nghim cho thy
rng phng php biu din vn bn da khi nim m c kt qu phn lp tt hn so vi
phng php biu din vn bn theo vector t kha.
Ngoi phn m u v kt lun, ni dung ca lun vn c trnh by trong 5
chng:
Kha lun tt nghip Nguyn Vit Cng

2
Chng 1, gii thiu tng quan v khai ph d liu vn bn, mt s nh ngha v
mt s bi ton in hnh.
Chng 2, trnh by mt s phng php biu din vn bn truyn thng: m
hnh tn sut, m hnh phn tch c php, m hnh khng gian vector... ng thi nu ra
cch biu din vn bn thng dng trong my tm kim.
Chng 3, gii thiu tng quan v l thuyt tp m [9][14] v mt s php ton
trn tp m. Ni dung chnh ca chng l cp mt cch biu din vn bn mi da
trn cc khi nim m.
Chng 4, trnh by bi ton phn lp vn bn v mt s thut ton phn lp tiu
biu.
Chng 5, ch ra cc kt qu thc nghim c c khi p dng m hnh biu din
mi trong bi ton phn lp vn bn. nh gi v so snh vi m hnh biu din thng
thng.














Kha lun tt nghip Nguyn Vit Cng

3
Chng 1. KHAI PH D LIU VN BN
1.1. Tng quan v khai ph d liu
1.1.1. Khi nim
Khai ph d liu[1][7][13] l mt khi nim ra i vo nhng nm cui ca thp
k 80 ca th k 20. N bao hm mt lot cc k thut nhm pht hin ra cc thng tin c
gi tr tim n trong cc tp d liu ln nh cc kho d liu, cc c s d liu (CSDL) c
dung lng rt ln. V bn cht, khai ph d liu lin quan n vic phn tch cc d liu
v s dng cc k thut tm ra cc mu c tnh h thng trong tp d liu.
Mt s nh ngha tiu biu v Data mining:
Khi nim data mining c nh ngha nh sau: The nontrivial extraction
of implicit, previously unknown, and potentially useful information from data [13], tm
dch: l vic trch rt mt cch phc tp cc thng tin - n, khng bit trc v c kh
nng hu ch - t d liu.
The science of extracting useful information from large data sets or databases
[1], tm dch l: Nghnh khoa hc chuyn trch chn nhng thng tin c gi tr t nhng
tp d liu ln hoc cc CSDL.
Nm 1989, Fayyad, Piatestky-Shapiro v Smyth a ra khi nim Pht hin
tri thc trong c s d liu (Knowledge Discovery in Databases - KDD) ch ton b
qu trnh pht hin cc tri thc c ch t cc tp d liu ln [6]. Trong , khai ph d
liu l mt bc c bit quan trng trong ton b qu trnh, s dng cc thut ton
chuyn dng chit xut ra cc mu (pattern) t d liu.
1.1.2. Cc bc ca qu trnh khai ph d liu
Cc thut ton khai ph d liu thng c miu t nh nhng chng trnh
hot ng trc tip trn tp d liu. Vi cc phng php hc my v thng k trc y,
thng th bc u tin ca cc thut ton l np ton b d liu vo trong b nh trong
x l. Khi chuyn sang cc ng dng cng nghip lin quan n vic khai ph cc kho
d liu ln, m hnh ny khng th p ng c. Khng ch bi v khng th np ht d
liu vo trong b nh m cn v khng th chit sut d liu ra cc tp n gin phn
tch c.
Kha lun tt nghip Nguyn Vit Cng

4
Qu trnh khai ph d liu bt u bng cch xc nh chnh xc vn cn gii
quyt. Sau s xc nh cc d liu lin quan dng xy dng gii php. Bc tip
theo l thu thp cc d liu c lin quan v x l chng thnh nh dng sao cho cc thut
ton khai ph d liu c th hiu c. V l thuyt th c v rt n gin nhng khi thc
hin th y thc s l mt qu trnh rt kh khn, gp phi nhiu vng mc nh d liu
phi c sao ra nhiu bn (nu c chit sut vo cc tp), qun l tp cc tp d liu,
phi lp i lp li nhiu ln ton b qu trnh (nu m hnh d liu thay i) ...
S l qu cng knh vi mt thut ton khai ph d liu nu phi truy nhp vo
ton b ni dung ca CSDL v lm nhng vic nh trn. V li, iu ny cng khng cn
thit. C rt nhiu thut ton khai ph d liu thc hin trn nhng thng k tm tt kh
n gin ca CSDL, khi m ton b thng tin trong CSDL l qu d tha i vi mc
ch ca vic khai ph d liu.
Bc tip theo l chn thut ton khai ph d liu thch hp v thc hin vic
khai ph tm c cc mu c ngha di dng biu din tng ng vi cc ngha
. Thng thng cc mu c biu din di dng lut phn loi, cy quyt nh, lut
sn xut, biu thc hi quy, ...

Hnh 1: Qu trnh khai ph d liu
c im ca cc mu l phi mi, t nht l i vi h thng . mi c th
c o tng ng vi thay i trong d liu (bng cch so snh cc gi tr hin ti vi
cc gi tr trc hoc cc gi tr mong mun), hoc bng tri thc (mi lin h gia cc
Kha lun tt nghip Nguyn Vit Cng

5
phng php tm mi v phng php c nh th no). Thng th mi ca mu c
nh gi bng cc hm logic hoc hm o mi, bt ng ca mu. Ngoi ra, mu
phi c kh nng s dng tim tng. Cc mu ny sau khi c x l v din gii phi
dn n nhng hnh ng c ch no c nh gi bi mt hm li ch. V d nh
trong d liu cc khon vay, hm li ch nh gi kh nng tng li nhun t cc khon
vay. Mu khai thc c phi c gi tr i vi cc d liu mi vi chnh xc no .
V khi thi hnh cc thut ton v cc nhim v ca khai ph d liu l rt khc
nhau cho nn dng ca cc mu chit xut c cng rt a dng. Theo cch n gin
nht, s phn tch cho ra kt qu chit xut l mt bo co v mt s loi, c th bao gm
cc php o mang tnh thng k v ph hp ca m hnh, cc d liu l... Trong thc
t th u ra phc tp hn nhiu. Mu chit sut c c th l mt m t xu hng, c
th di dng vn bn, mt th m t cc mi quan h trong m hnh, cng c th l
mt hnh ng, v d nh yu cu ca ngi dng lm g vi nhng g khai thc c
trong CSDL.
Nh vy c th nhn thy rng k thut khai ph d liu thc cht l s k tha,
kt hp v m rng ca cc k thut c bn c nghin cu t trc nh hc my,
nhn dng, thng k (hi quy, xp loi, phn nhm), cc m hnh th, mng Bayes, tr
tu nhn to, thu thp tri thc h chuyn gia... Tuy nhin, vi s kt hp hng mc tiu
ca khai ph d liu, k thut ny c u th hn hn cc phng php trc , em li
nhiu trin vng trong vic ng dng pht trin nghin cu khoa hc cng nh lm tng
mc li nhun trong cc hot ng kinh doanh.
1.1.3. ng dng ca khai ph d liu
Tuy l mt hng tip cn mi nhng khai ph d liu thu ht c rt nhiu
s quan tm ca cc nh nghin cu v pht trin nh vo nhng ng dng thc tin ca
n [xx]. Chng ta c th lit k ra y mt s ng dng in hnh:
Phn tch d liu v h tr ra quyt nh (data analysis & decision support)
iu tr y hc (medical treatment)
Text mining & Web mining
Tin-sinh hc (bio-informatics)
Ti chnh v th trng chng khon (finance & stock market)
Kha lun tt nghip Nguyn Vit Cng

6
Phn tip theo, chng ti xin trnh by khi qut v Text Mining (gi theo ting
Vit l Khai ph d liu vn bn), mt trong nhng ng dng in hnh nu trn ca khai
ph d liu.
1.2. Mt s bi ton trong khai ph d liu vn bn
1.2.1. Tm kim vn bn
Ni dung:
Tm kim vn bn[2][10] l qu trnh tm kim vn bn theo yu cu ca ngi
dng. Cc yu cu c th hin di dng cc cu hi (query), dng cu hi n gin
nht l cc t kha. C th hnh dung h tm kim vn bn sp xp tp vn bn trong min
tm kim thnh hai lp: Mt lp c hin th bao gm cc vn bn tha mn vi cu hi
ngi dng v mt lp khng c hin th bao gm cc vn bn khng tha mn yu
cu. Thc t, cc h thng tm kim in hnh hin nay, chng hn nh cc my tm kim
nh Google, Altavista, khng hot ng nh vy m a ra danh sch cc vn bn theo
lin quan ca vn bn vi cu hi ngi dng
Qu trnh tm kim
Qu trnh tm kim c chia thnh bn qu trnh thnh phn chnh :
nh ch s (indexing): Cc vn bn dng th cn c chuyn sang mt dng
biu din no x l. Qu trnh ny cn c gi l qu trnh biu din vn bn,
dng biu din phi c cu trc v d dng khi x l. Mt ni dung quan trng ca kha
lun ny l nghin cu cch thc biu din vn bn s dng l thuyt tp m nhm c
c biu din vn bn mang nhiu ng ngha hn.
nh dng cu hi: Ngi dng phi m t nhng yu cu v ly thng tin cn
thit di dng cu hi. Cc cu hi ny phi c biu din di dng ph bin cho cc
h tm kim nh nhp vo cc t kha cn tm. Ngoi ra cn c cc phng php nh
dng cu hi di dng ngn ng t nhin hoc di dng cc v d, i vi cc dng ny
th cn c cc k thut x l phc tp hn. i a s h tm kim hin nay dng cu hi
di dng cc t kha.
So snh: H thng phi thc hin vic so snh tng minh v ton vn cu hi
ca ngi dng vi cc vn bn c lu tr trong CSDL. Cui cng h thng a ra mt
Kha lun tt nghip Nguyn Vit Cng

7
quyt nh phn loi cc vn bn theo lin quan gn vi cu hi ngi dng v sp xp
theo th t gim dn ca lin quan. H thng hoc hin th ton b vn bn hoc ch
mt phn vn bn.
Phn hi: Trong nhiu trng hp, kt qu c tr v lc u cha phi tha
mn yu cu ca ngi dng, do cn phi c qu trnh phn hi ngi dng c th
thay i li hoc nhp mi cc yu cu ca mnh. Mt khc, ngi dng c th tng tc
vi cc h v cc vn bn tha mn yu cu ca mnh v h c chc nng cp nhu cc
vn bn . Qu trnh ny c gi l qu trnh phn hi lin quan (Relevance feeback).
Cc cng c tm kim hin nay ch yu tp trung nhiu vo ba qu trnh con u
tin, cn phn ln cha c qu trnh phn hi hay x l tng tc ngi dng v my.
Qu trnh phn hi hin nay ang c nghin cu rng ri v ring trong qu trnh tng
tc giao din ngi my xut hin hng nghin cu c gi l tc t giao din
(interface agent).
1.2.2. Phn lp vn bn
Ni dung
Phn lp vn bn [3][5][8][11][12] c xem nh l qu trnh gn cc vn bn
vo mt hay nhiu lp vn bn c xc nh t trc. Ngi ta c th phn lp cc
vn bn mt cch th cng, tc l c ni dung tng vn bn mt v gn n vo mt lp
no . H thng qun l tp gm rt nhiu vn bn cho nn cch ny s tn rt nhiu thi
gian, cng sc v do l khng kh thi. Do vy m phi c cc phng php phn lp
t ng. phn lp t ng ngi ta s dng cc phng php hc my trong tr tu
nhn to nh Cy quyt nh, Bayes, k ngi lng ging gn nht...
Mt trong nhng ng dng quan trng nht ca phn lp vn bn t ng l ng
dng trong cc h thng tm kim vn bn. T mt tp con vn bn phn lp sn, tt c
cc vn bn trong min tm kim s c gn ch s lp tng ng. Trong cu hi ca
mnh, ngi dng c th xc nh ch hoc lp vn bn m mnh mong mun tm kim
h thng cung cp ng yu cu ca mnh.
Mt ng dng khc ca phn lp vn bn l trong lnh vc hiu vn bn. Phn
lp vn bn c th c s dng lc cc vn bn hoc mt phn cc vn bn cha d
liu cn tm m khng lm mt i tnh phc tp ca ngn ng t nhin.
Kha lun tt nghip Nguyn Vit Cng

8
Trong phn lp vn bn, s tng ng gia mt vn bn vi mt lp hoc thng
qua vic gn gi tr ng sai (True - vn bn thuc lp, hay False -vn bn khng thuc
lp) hoc thng qua mt ph thuc (o ph thuc ca vn bn vo lp). Trong
trng hp c nhiu lp th phn loi ng sai s l vic xem mt vn bn c thuc vo
mt lp duy nht no hay khng.
Qu trnh phn lp
Qu trnh phn lp vn bn tun theo cc bc sau:
nh ch s: Qu trnh nh ch s vn bn cng ging nh trong qu trnh nh
ch s ca tm kim vn bn. Trong qu trnh ny th tc nh ch s ng vai tr quan
trng v xut hin lng ng k vn bn mi c th cn c nh ch s trong thi gian
thc.
Xc nh phn lp: Cng ging nh trong tm kim vn bn, phn lp vn bn
yu cu qu trnh din t vic xc nh vn bn thuc lp no ra sao (m hnh phn
lp) da trn cu trc biu din ca n. i vi h phn lp vn bn, chng ta gi qu
trnh ny l b phn lp (Categorizator hoc classifier). N ng vai tr nh cc cu hi
trong h tm kim. Tuy nhin, trong khi nhng cu hi mang tnh nht thi, th b phn
lp c s dng mt cch n nh v lu di cho qu trnh phn lp.
So snh: Trong hu ht cc b phn lp, mi vn bn u c yu cu gn ng
sai vo mt lp no . S khc nhau ln nht i vi qu trnh so snh trong h tm kim
vn bn l mi vn bn ch c so snh vi mt s lng cc lp mt ln v vic chn
quyt nh ph hp cn ph thuc vo mi quan h gia cc lp vn bn.
Phn hi (Hay thch nghi): Qu trnh phn hi ng vai tr quan trng trong h
phn lp vn bn. Th nht, khi phn lp th phi c mt s lng ln cc vn bn
c xp loi bng tay trc , cc vn bn ny c s dng lm mu hun luyn
h tr xy dng b phn lp. Th hai, i vi vic phn lp vn bn th khng d dng
thay i cc yu cu nh trong qu trnh phn hi ca tm kim vn bn bi v ngi dng
ch c th thng tin cho ngi bo tr h thng v vic xa b, thm vo hoc thay i cc
phn lp vn bn no m mnh yu cu.
Ngoi hai bi ton thng gp trn, cn c cc bi ton khc sau:
Kha lun tt nghip Nguyn Vit Cng

9
Phn cm vn bn: a cc vn bn c ni dung ging nhau vo thnh tng
nhm
Tm tt vn bn: Tm tt ni dung mt vn bn cho trc
Dn ng vn bn: a mt vn bn cho trc vo mt ch hoc mt
ni lu tr nht nh theo yu cu ngi dng
Trong cc bi ton nu trn, vn bn thng c biu din thnh mt tp cc
thuc tnh c trng cho vn bn . Cc qu trnh x l v lm vic tip theo u thc
hin trn cc thuc tnh ny. C nhiu tiu chun chn la cc thuc tnh biu din, tuy
nhin u da trn vic x l t kha mt cch t ng.
Trong chng tip theo, kha lun trnh by mt s phng php biu din vn
bn truyn thng.
Kha lun tt nghip Nguyn Vit Cng

10
Chng 2. CC PHNG PHP C BN BIU DIN
VN BN
2.1. Tin x l vn bn
Trc khi bt u qu trnh biu din vn bn, ngi ta tin hnh bc tin x l
vn bn. y l bc ht sc quan trng v n c nhim v lm gim s t c trong biu
din vn bn v qua s lm gim kch thc d liu trong biu din vn bn.
Ni dung tin x l vn bn:
Phn tch t vng
Bc phn tch t vng nhm xc nh cc t c trong vn bn. Kt qu ca cng
vic ny l cho ra mt tp cc t ring bit. Tuy nhin trong nhiu trng hp cn c cch
i x ring bit i vi mt s t c bit, chng hn nh s, du ngoc, du chm cu
v trng hp ch hoa, ch thng. V d v cch ng x c bit, s thng b loi ra
trong khi phn tch v mt mnh n khng mang li mt ngha no cho ti liu (ngoi tr
mt vi trng hp c bit, v d trong thu thp thng tin v lnh vc lch s). Du chm
cu, v d nh ., !, ?, -, v.v cng thng c loi ra m khng c nh hng
g n ni dung ca ti liu. Tuy nhin cn phi ch trong mt vi trng hp, chng
hn i vi nhng t ghp ni (state-of-the-art) l khng c php b du -, v s lm
thay i ngha ca t.
Loi b t dng
T dng ( stop-words) dng ch cc t m xut hin qu nhiu trong cc vn
bn ca ton tp kt qu, thng th khng gip ch g trong vic phn bit ni dung ca
cc ti liu. V d, nhng t web, site, link, www, v.v[??] thng xut hin
hu ht trong cc vn bn th c gi l stop-words. Ngoi ra, trong ting Anh, c nhiu
t ch dng phc v cho biu din cu trc ch khng biu t ni dung ca n nh l
a, the (mo t), in (gii t) , but (lin t), ng t ph bin c dng to, be, v
mt s trng t v tnh t c bit cng c xem l nhng t dng (stop-words).
V c im ca t dng nn chng c loi b m khng nh hng n cc
cng vic biu din vn bn tip theo.
Bng danh sch mt s t dng trong ting Anh:
Kha lun tt nghip Nguyn Vit Cng

11

Danh sch mt s t dng trong ting Vit: v; hoc; cng; l; mi; bi
Loi b t c tn s thp
Khi quan st vn bn, ngi ta thy rng: C nhiu t trong tp vn bn gc
xut hin rt t ln v chng s c nh hng rt t trong vn bn. V vy vn t ra l
cn loi b nhng t c tn xut nh. Ngi ta p dng phng php c a ra bi
Zipf nm 1949: quan st tn xut xut hin ca cc t trong tp vn bn.
Gi tn s xut hin ca t kha t trong tp hp X l f
t
. Xp xp tt c cc
t kha trong tp hp theo chiu gim dn ca tn s f, v gi th hng ca mi t kha t
l r
t
. inh lut Zipf c pht biu di dng cng thc sau:
f
t
.r
t
K
Trong : K l mt hng s. Nu N l tng s t trong tp vn bn, th ngi ta
thy rng
10
N
K .
Nh vy, tn s xut hin v th hng ca mt t kha l hai i lng nghch
o ca nhau. thy r hn iu ny, ngi ta biu din li cng thc nh lut Zipf
theo cng thc sau:
Kha lun tt nghip Nguyn Vit Cng

12

t
t
f
K
r
V biu din theo lc :


Loi b tin t v hu t
Loi b tin t v hu t (ting Anh l Stemming) tin hnh vic loi b tin t v
hu t ca t bin i n thnh t gc. V trong thc t mt t gc c th c nhiu hnh
thi bin i, chng hn nh ng t, danh t, tnh t, trng t; v gia chng c mi
quan h ng ngha. V d nh nhng t: clusters, clustering, clustered l c cng
mi quan h vi t cluster. Do vy cn phi Stemming lm gim c s lng t
m vn khng lm nh hng n ni dung ti liu.
Tuy nhin tn ti mt vn thiu st xy ra khi stemming, v thut ton
stemming s dng mt tp cc quy tc n gin loi b tin t/hu t. Do vy n c
th sinh ra cc t khng chnh xc. V d nh computing, computation sau khi
stemming s cn l comput trong khi t ng phi l compute.
2.2. M hnh Logic
Theo m hnh ny cc t c ngha trong vn bn s c nh ch s v ni dung
vn bn c qun l theo cc ch s Index . Mi vn bn c nh ch s theo quy
tc lit k cc t c ngha trong cc vn bn vi v tr xut hin ca n trong vn bn. T
c ngha l t mang thng tin chnh v cc vn bn lu tr, khi nhn vo n ngi ta c
th bit ch ca vn bn cn biu din.
Hnh 2. Lc cc t theo inh lut
Kha lun tt nghip Nguyn Vit Cng

13
Tin hnh Index cc vn bn a vo theo danh sch cc t kho ni trn. Vi
mi t kha ngi ta s nh s th t v tr xut hin ca n v lu li ch s cng vi
m vn bn cha n. Cch biu din ny cng c cc my tm kim a dng.
V d, c hai vn bn vi m tng ng l VB1,VB2.
Cng ha x hi ch ngha Vit Nam (VB1)
Vit Nam dn ch cng ha (VB2)
Khi ta c cch biu din nh sau:



Khi biu din vn bn theo phng php ny ngi ta a ra cch tm kim nh sau:
Cu hi tm kim c a ra di dng Logic, tc l gm mt tp hp cc php
ton (AND, OR,) c thc hin trn cc t hoc cm t. Vic tm kim s da vo
bng Index to ra v kt qu tr li l cc vn bn tho mn ton b cc iu kin trn
Mt s u im, nhc im:
u im
Vic tm kim tr nn nhanh v n gin.
Thc vy, gi s cn tm kim t computer. H thng s duyt trn bng Index
tr n ch s Index tng ng nu t computer tn ti trong h thng. Vic tm
kim ny l kh nhanh v n gin khi trc ta sp xp bng Index theo vn ch
Kha lun tt nghip Nguyn Vit Cng

14
ci. Php tm kim trn c phc tp cp (nlog
2
n), vi n l s t trong bng Index.
Tng ng vi ch s index trn s cho ta bit cc ti liu cha t kha tm kim. Nh
vy vic tm kim lin quan n k t th cc php ton cn thc hin l k*n*log
2
n (n l s
t trong bng Index)

Cu hi tm kim linh hot
Ngi dng c th s dng cc k t c bit trong cu hi tm kim m khng
lm nh hng n phc tp ca php tm kim. V d mun tm t ta th kt qu s
tr li cc vn bn c cha cc t ta, tao, tay,l cc t bt u bng t ta
K t % c gi l k t i din (wildcard character).
Ngoi ra, bng cc php ton Logic cc t cn tm c th t chc thnh cc cu
hi mt cch linh hot. V d: Cn tm t [ti, ta, tao], du [] s thay cho ngha ca t
hoc - th hin vic tm kim trn mt trong s nhiu t trong nhm. y thc ra l mt
cch th hin linh hot php ton OR trong i s Logic thay v phi vit l: Tm cc ti
liu c cha t ti hoc t ta hoc tao.
Nhc im
i hi ngi tm kim phi c kinh nghim v chuyn mn trong lnh vc tm
kim v cu hi a vo di dng Logic nn kt qu tr li cng c gi tr Logic
(Boolean). Mt s ti liu s c tr li khi tho mn mi iu kin a vo. Nh vy
mun tm c ti liu theo ni dung th phi bit ch xc v ti liu.
Vic Index cc ti liu rt phc tp v lm tn nhiu thi gian, ng thi cng tn
khng gian lu tr cc bng Index.
Cc ti liu tm c khng c xp xp theo chnh xc ca chng. Cc bng
Index khng linh hot v khi cc t vng thay i (thm, xa,) th dn ti ch s Index
cng phi thay i theo.
2.3. M hnh phn tch c php
Trong m hnh ny, mi vn bn u phi c phn tch c php v tr li thng
tin chi tit v ch ca vn bn . Sau , ngi ta tin hnh Index cc ch ca tng
Kha lun tt nghip Nguyn Vit Cng

15
vn bn. Cch Index trn ch cng ging nh khi Index trn vn bn nhng ch Index
trn cc t xut hin trong ch .
Cc vn bn c qun l thng qua cc ch ny c th tm kim c khi
c yu cu, cu hi tm kim s da trn cc ch trn.
Cch tm kim:
Tin hnh tm kim bng cch da vo cc ch c Index trn. Cu hi
a vo c th c phn tch c php tr li mt ch v tm kim trn ch .
Nh vy b phn x l chnh i vi mt h CSDL xy dng theo m hnh ny
chnh l h thng phn tch c php v on nhn ni dung vn bn.
Mt s u im, nhc im ca phng php ny
u im
Tm kim theo phng php ny li kh hiu qu v n gin, do tm kim nhanh
v chnh xc.
i vi nhng ngn ng n gin v mt ng php th vic phn tch trn c th
t c mc chnh xc cao v chp nhn c.
Nhc im
Cht lng ca h thng theo phng php ny hon ton ph thuc vo cht
lng ca h thng phn tch c php v on nhn ni dung ti liu. Trn thc t, vic
xy dng h thng ny l rt phc tp, ph thuc vo c im ca tng ngn ng v a
s vn cha t n chnh xc cao.
2.4. M hnh khng gian vector
Cch biu din vn bn thng dng nht l thng qua vector biu din theo m
hnh khng gian vector (Vector Space Model). y l mt cch biu din tng i n
gin v hiu qu.
Theo m hnh ny, mi vn bn c biu din thnh mt vector. Mi thnh phn
ca vector l mt t kha ring bit trong tp vn bn gc v c gn mt gi tr l hm
f ch mt xut hin ca t kha trong vn bn.
Kha lun tt nghip Nguyn Vit Cng

16

Hnh 3: Biu din cc vector vn bn trong khng gian 2 chiu

Gi s ta c mt vn bn v n c biu din bi vector V(v
1
,v
2
, , v
n
). Trong
, v
i
l s ln xut hin ca t kha th i trong vn bn. Ta xt 2 vn bn sau:
VB1: Life is not only life
VB2: To life is to fight
Sau khi qua bc tin x l vn bn, ta biu din chng nh sau:

Trong cc c s d liu vn bn, m hnh vector l m hnh biu din vn bn
c s dng ph bin nht hin nay. Mi quan h gia cc trang vn bn c thc hin
thng qua vic tnh ton trn cc vector biu din v vy c thi hnh kh hiu qu. c
bit, nhiu cng trnh nghin cu v mi quan h "tng t nhau" gia cc trang web
(mt trong nhng quan h in hnh nht gia cc trang web) da trn m hnh biu din
vector .
Kha lun tt nghip Nguyn Vit Cng

17
2.4.1. M hnh Boolean
Mt m hnh biu din vector vi hm f cho ra gi tr ri rc vi duy nht hai gi
tr ng v sai (true v false, hoc 0 v 1) gi l m hnh Boolean. Hm f tng ng vi
t kha ti s cho ra gi tr ng nu v ch nu t kha ti xut hin trong vn bn .
M hnh Boolean c xc nh nh sau:
Gi s c mt c s d liu gm m vn bn, D = {d
1
, d
2
, d
m
}. Mi vn bn
c biu din di dng mt vector gm n t kha T = {t
1
, t
2
,t
n
}. Gi W = {w
ij
} l ma
trn trng s, trong w
ij
l gi tr trng s ca t kha t
i
trong vn bn d
j
.

=
lai nguoc neu
trong mat co neu
0
d t 1
w
j i
ij

Tr li vi 2 vn bn trn, p dng m hnh Boolean ta c biu din sau:


2.4.2. M hnh tn sut
Trong m hnh tn sut, ma trn W = {wij} c xc nh da trn tn s xut
hin ca t kha ti trong vn bn dj hoc tn s xut hin ca t kha ti trong ton b c
s d liu. Sau y l mt s phng php ph bin:
a. Phng php da trn tn s t kha (TF Term Frequency)
Cc gi tr wij c tnh da trn tn s (hay s ln) xut hin ca t kha trong
vn bn. Gi fij l s ln xut hin ca t kha ti trong vn bn dj, khi wij c tnh
bi mt trong ba cng thc:
w
ij
= f
ij

w
ij
= 1 + log(f
ij
)
Kha lun tt nghip Nguyn Vit Cng

18
w
ij
=
ij
f
Trong phng php ny, trng s wij

t l thun vi s ln xut hin ca t kha
ti trong vn bn dj. Khi s ln xut hin t kha ti trong vn bn dj

cng ln th iu c
ngha l vn bn dj cng ph thuc vo t kha ti, hay ni cch khc t kha ti mang
nhiu thng tin trong vn bn dj.
V d, khi vn bn xut hin nhiu t kha my tnh, iu c ngha l vn bn
ang xt ch yu lin quan n lnh vc tin hc.
Nhng suy lun trn khng phi lc no cng ng. Mt v d in hnh l t
v xut hin nhiu trong hu ht cc vn bn, nhng trn thc t t ny li khng mang
nhiu ngha nh tn sut xut hin ca n. Hoc c nhng t khng xut hin trong vn
bn ny nhng li xut hin trong vn bn khc, khi ta s khng tnh c gi tr ca
log(f
ij
). Mt phng php khc ra i khc phc c nhc im ca phng php TF,
l phng php IDF.
b. Phng php da trn nghch o tn s vn bn (IDF Inverse Document
Frequency)
Trong phng php ny, gi tr wij

c tnh theo cng thc sau:

=
=
li ngc nu
liu ti trong xut hin kha t nu
0
d t ) h log( ) m log(
h
m
log
w
j i i
i ij

trong m l s lng vn bn v hi l s lng vn bn m t kha ti xut hin.
Trng s wij trong cng thc ny c tnh da trn quan trng ca t kha ti
trong vn bn dj. Nu ti xut hin trong cng t vn bn, iu c ngha l khi n xut
hin trong dj th trng s ca n i vi vn bn dj cng ln hay n l im quan trng
phn bit vn bn dj vi cc vn bn khc v hm lng thng tin trong n cng ln.
c. Phng php TF IDF
Phng php ny l tng hp ca hai phng php TF v IDF, gi tr ca ma trn
trng s c tnh nh sau:
Kha lun tt nghip Nguyn Vit Cng

19

+
=
li ngc nu
f nu 1
0
1
h
m
log )] f log( [
w
ij
i
ij
ij

y l phng php kt hp c u im ca c hai phng php trn. Trng s
wij

c tnh bng tn s xut hin ca t kha ti trong vn bn dj v him ca t kha
ti trong ton b c s d liu.

Cch tm kim:
Cc cu hi a vo s c nh x vector Q(q
1
,q
2,,
q
m
)

theo h s ca cc t
vng trong n l khc nhau. Tc l: Khi t vng cng c ngha vi ni dung cn tm th
n c h s cng ln.
q
i
= 0 khi t vng khng thuc danh sch nhng t cn tm.
q
i
<> 0 khi t vng thuc danh sch cc t cn tm.
Khi , cho mt h thng cc t vng ta s xc nh c cc vector tng ng
vi tng ti liu v ng vi mi cu hi a vo ta s c mt vector tng vi n cng
nhng h s c xc nh t trc. Vic tm kim v qun l s c thc hin trn
ti liu ny.

Mt s u, nhc im ca phng php biu din ny:
u im:
Cc ti liu tr li c th c sp xp theo mc lin quan n ni dung yu
cu do trong php th mi ti liu u tr li ch s nh gi lin quan ca n n ni
dung.
Vic a ra cc cu hi tm kim l d dng v khng yu cu ngi tm kim c
trnh chuyn mn cao v vn .
Tin hnh lu tr v tm kim n gin hn phng php Logic.

Nhc im
Kha lun tt nghip Nguyn Vit Cng

20
Vic tm kim tin hnh chm khi h thng cc t vng l ln do phi tnh ton
trn ton b cc vector ca ti liu.
Khi biu din cc vector vi cc h s l s t nhin s lm tng mc chnh
xc ca vic tm kim nhng lm tc tnh ton gim i rt nhiu do cc php nhn
vector phi tin hnh trn cc s t nhin hoc s thc, hn na vic lu tr cc vector s
tn km v phc tp.
H thng khng linh hot khi lu tr cc t kha. Ch cn mt thay i rt nh
trong bng t vng s ko theo hoc l vector ho li ton b cc ti liu lu tr, hoc l
s b qua cc t c ngha b sung trong cc ti liu c m ha trc .
Mt nhc im na, chiu ca mi Vector theo cch biu din ny l rt ln, bi
v chiu ca n c xc nh bng s lng cc t khc nhau trong tp hp vn bn. V
d s lng cc t c th c t 10
3
n 10
5
trong tp hp cc vn bn nh, cn trong tp
hp cc vn bn ln th s lng s nhiu hn, c bit trong mi trng Web.
2.5. Biu din vn bn trong my tm kim
2.5.1. Gii thiu v my tm kim
Thng tin trn cc trang Web a dng v mt ni dung cng nh hnh thc. Tuy
nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti
thng tin. i vi mi ngi dng ch mt phn rt nh thng tin l c ch, chng hn c
ngi ch quan tm n trang Th thao, Vn ha m khng my khi quan tm n Kinh
t. Ngi ta khng th tm t kim a ch trang Web cha thng tin m mnh cn, do vy
i hi cn phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm
thy cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Hin nay
chng ta lm quen vi mt s cc tin ch nh vy l: Yahoo, Google, Alvista,...
My tm kim l cc h thng c xy dng c kh nng tip nhn cc yu cu
tm kim ca ngi dng (thng l mt tp cc t kha), sau phn tch v tm kim
trong c s d liu c sn v a ra cc kt qu l cc trang web cho ngi s dng.
C th, ngi dng gi mt truy vn, dng n gin nht l mt danh sch cc t kha, v
my tm kim s lm vic tr li mt danh sch cc trang Web c lin quan hoc c
cha cc t kha . Phc tp hn, th truy vn l c mt vn bn hoc mt on vn bn
hoc ni dung tm tt ca vn bn.
Kha lun tt nghip Nguyn Vit Cng

21
2.5.2. M hnh biu din vn bn trong my tm kim
Nh c gii thiu, m hnh vector l m hnh biu din ph bin nht trong
cc CSDL vn bn. Tuy nhin, cn c mt cc biu din khc cng thng c s dng,
c bit trong cc my tm kim, biu din theo m hnh index ngc (inverted index).
Vi mt t kho trong cu hi ca ngi dng, thng qua m hnh index ngc h thng
c s d liu vn bn s nhanh chng xc nh c tp hp cc vn bn cha t kha
v cc v tr xut hin ca t kha trong cc vn bn kt qu.
dng n gin nht, m hnh index ngc c dng nh c m t nh hnh
sau:

Hnh 4. M hnh index ngc
Trong m hnh ny, tn ti tp hp V (c gi l t in) gm tt c cc t kha
trong h thng; cc t kha trong V c lu tr theo danh sch Inverted Index. Mi mt
t kha v
i
trong V lin kt vi mt con tr b(v
i
) ch dn ti mt cu trc d liu, c gi
l brucket, l mt danh sch cha tt c cc bn ghi m t vn bn cha t kha v
i
v v
tr xut hin ca t kha v
i
trong vn bn (hnh 2). Tn ti mt s gii php t chc t
in V hiu qu nhm cho php lu tr V b nh trong, chng hn V thng c t
chc theo dng bng bm tng hiu qu truy cp. Nu nh brucket c lu ngay
trong b nh trong th vic thay i chnh sa cc brucket c thc hin rt d dng.
Kha lun tt nghip Nguyn Vit Cng

22
Tuy nhin, iu ny l khng kh thi do kch thc ca chng thng kh ln so vi kch
thc b nh trong. V vy, cc brucket (cng nh ni dung cc vn bn) c lu trong
a cng. cc c s d liu vn bn c kh nng qun l c mt lng ln cc trang
vn bn th cn c cc thut ton chuyn bit nhm m bo vic thao tc ti cc brucket
trn a cng c nhanh chng.
CSDL vn bn s dng m hnh index ngc cho kh nng tm ra cc trang
vn bn c cha t kha v
i
cho trc l kh n gin. u tin, h thng truy cp vo
inverted index ly b(v
i
) v sau duyt danh sch theo cc con tr ca b(v
i
) ly
c cc trang vn bn. Trng hp cu truy vn c dng mt biu thc phc tp c
nhiu t kha c kt ni vi nhau theo cc php ton lgic nh AND, OR, NOT th
cng vic tm kim phc tp hn. Vi cu truy vn c k t kha, thut ton thc hin vic
ly cc trang vn bn tng ng vi mi t kha (da trn thut ton tm kim theo t
kha ni trn) v nhn c k danh sch trang vn bn. Kt qu tr li cu truy vn nhn
c bng cch kt hp k danh sch ny tng ng vi biu thc lgic cho.
Trong mi trng hp, s dng biu din index ngc th tm kim vn bn p
ng cu hi thng qua t kho s c tc rt nhanh.
Kha lun tt nghip Nguyn Vit Cng

23
Chng 3. BIU DIN VN BN S DNG CC KHI
NIM M
Trong chng ny chng ti s trnh by mt s khi nim c bn v tp m, tin
hnh nh ngha cc khi nim m v mt s tnh cht ca cc khi nim m thng qua
vic tch hp cc t kha v mi quan h gia chng vi nhau. T , s gii thiu
phng php biu din vn bn theo khi nim m.
3.1. L thuyt m
C th ni cho n nay, phn ln cc thnh tu ca khoa hc ca loi ngi u
da trn lp lun logic rt cht ch m nn tng ca cc lp lun ny l i s logic Bool.
Trong i s logic Bool mi ton hng, biu thc ch c gi tr 0 (false) hoc 1 (true). Tuy
nhin trn thc t iu ny khng lun lun ng, nhiu hin tng trong t nhin v x
hi khng th biu din r rng nh vy. c th phn nh ng bn cht ca cc s
vt, hin tng din ra trong thc t, buc ngi ta phi m rng i s Bool sao cho
cc ton hng, cc biu thc c th nhn gi tr khng ch l 0 hoc 1 m chng c th
nhn gi tr no nm gia 0 v 1.
Mt cch t nhin xy dng l thuyt m, ngi ta phi i t nhng khi nim
nguyn thu nht. Ging nh trong ton hc, mt trong nhng khi nim nguyn thu ca
ton hc l tp hp, trong l thuyt m ngi ta i t xy dng tp m.
3.1.1. Tp m
Trong ton hc truyn thng khi nim tp hp c pht biu nh sau:
Cho tp hp X v A X khi ta c th xy dng mt hm, c gi l hm c
trng, xc nh cc phn t ca tp X nh sau:
Xt : X {0,1 } vi x X th:
(x) = 1 nu x A;
(x) = 0 nu x A;
Hm c trng (x) r rng l hm xc nh cc phn t ca tp A. Nh hm (x)
ta c th ni tp A l tp gm nhng phn t x m (x)=1. By gi tp A c th biu din
mt cch khc qua cc phn t ca tp X:
Kha lun tt nghip Nguyn Vit Cng

24
A={(x, (x)=1)| x X}
M rng khi nim tp hp ca ton hc hc c in nu trn, Lofti Zadeh xt
hm trn ton on [0,1].
nh ngha 3.1: Tp m
Cho X l mt tp hp. A c gi l mt tp m trong X nu: A = {(x,
A
(x))|
xX} trong
A
(x) l hm xc nh trn on [0,1],
A
: X [0,1]. Hm
A
c gi l
hm thuc ca A cn
A
(x) l mt gi tr trong on [0,1] c gi l mc thuc ca x
trong A.
Biu din tp m
Khi X l tp cc im ri rc x
1
, x
2
, x
n
th

tp m A c th biu din bng cch
lit k A = {(x
1
,
A
(x
1
)), (x
2
,
A
(x
2
)),...... (x
n
,
A
(x
n
))}
Hoc c k hiu l:
A =
A
(x
1
)/ x
1
+
A
(x
2
)/ x
2
+ +
A
(x
n
)/ x
n
Trng hp X lin tc th A c k hiu l:
A =


X x
A
x / ) x (
V d:
Cho X l tp cc im tng kt trung bnh cc mn hc ca sinh vin. Qua thng
k ngi ta thy rng :
0% s ngi coi mt sinh vin l gii khi im tng kt t di 7.0
5% s ngi coi mt sinh vin l gii khi im tng kt t im t 7.0 n 7.5
10% s ngi coi mt sinh vin l gii khi im tng kt t n 8.0;
20% s ngi coi mt sinh vin l gii khi im tng kt t n 8.5;
80% s ngi coi mt sinh vin l gii ch khi im tng kt t t 9 n 9,5 .
100% s ngi coi mt sinh vin l gii khi im tng kt t n im 10
By gi cn biu din tp cc im trn X, c k hiu l tp A, m t mt
"sinh vin gii". Vi kt qu thng k nh trn, khng th dng khi nim tp hp theo
Kha lun tt nghip Nguyn Vit Cng

25
quan nim truyn thng biu din tp A. Trong trng hp ny, khi nim tp m l
rt hu dng v A chnh l mt tp m. Nu xt X ch gm cc i lng hu hn, X =
{7, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0}, th tp m A c biu din nh sau:
A={ (7, 0.05),(7.5,0.05),(8.0,0.1), (8.5, 0.2), (9.0,0.8) (9.5,0.8),(10,1.0 ) }
Hoc:
A= 0.05/7 + 0.05/7.5 + 0,0.1/8 + 0.2/8.5 + 0,0.8/9 + 0.8/9.5 + 1.0/10
Nu xt X l mt khong lin tc X = [7.0, 10] th ta c th biu din th hm
thuc ca A nh sau:

Hnh 5: th hm ph thuc tp m A
3.1.2. Cc php ton trn tp m
Giao ca hai tp m.
Cho X l tp hp, A, B l hai tp m trong X v c cc hm thuc ln lut l
A
,

B
. Giao ca hai tp m A v B, k hiu AB, l mt tp m c hm thuc
AB
xc nh
nh sau:

AB
(x) = min(
A
(x),
B
(x)) xX
Hp ca hai tp m.
Cho X l tp hp, A, B l hai tp m trong X v c cc hm thuc ln lut l
A
,

B
. Hp ca hai tp m A v B trong X, k hiu AB, l mt tp m c hm thuc
AB

xc nh nh sau:

AB
(x) ) = max(
A
(x),
B
(x)) xX
Kha lun tt nghip Nguyn Vit Cng

26
Tch i s ca hai tp m.
Cho X l tp hp, A, B l hai tp m trong X v c cc hm thuc ln lt l

A
(x),
B
(x). Tch i s ca hai tp m A v B trong X, k hiu A.B l mt tp m c
hm thuc c xc nh nh sau:

A.B
(x) =
A
(x).
B
(x) xX
Tng i s ca hai tp m.
Cho X l tp hp, A, B l hai tp m trong X v c cc hm thuc ln lt l
A
,

B
. Tng i s ca hai tp m A v B trong X, k hiu A+B l mt tp m c hm thuc
c xc nh nh sau:

A+B
(x) =
A
(x) +
B
(x) -
A
(x).
B
(x) xX.
Phn b ca mt tp m.
Cho A l tp m trong X c hm thuc
A
. Phn b A ca A trong X l mt tp
m c hm thuc xc nh nh sau:
) x (
A
= 1 -
A
(x) xX.
Tng ri ca hai tp m.
Cho X l tp hp, A v B l hai tp m trong X. Tng ri ca hai tp m A v B
trong X, k hiu AB nh ngha nh sau:
AB = ( AB) (AB )
Php tr hai tp m.
Cho X l tp hp, A, B l hai tp m trong X v c cc hm thuc ln lt l
A
,

B
. Php tr ca hai tp m A v B trong X k hiu A\B c nh ngha nh sau:
A\B = AB .
Cho X l tp hp, A v B l hai tp m trong X, c cc hm thuc ln lt l
A
,

B
. A gi l nm trong B, k hiu AB nu hm thuc tha mn:

A
(x)
B
(x) xX.
Kha lun tt nghip Nguyn Vit Cng

27
Cho X l tp hp, A v B l hai tp m trong X, c cc hm thuc ln lt l
A
,

B
. A gi l bng B, k hiu A=B nu v ch nu:

A
(x) =
B
(x) xX
Tp hp mc ca tp m.
Cho [0,1], X l tp hp, A l mt tp m trong X c hm thuc
A
. Tp hp
A

tho mn A

={xX |
A
(x) } gi l tp hp mc ca tp m A.
Khong cch Euclid trn tp m
X l tp hp c hu hn n phn t, A v B l hai tp m trn X. Khong cch
Euclid (trong khng gian n chiu) trn tp m c tnh nh sau:
e(A,B) =

=

n
1 i
2
i B i A
)) x ( ) x ( (
Khong cch e
2
(A,B) c gi l mt chun Euclid.
3.1.3. Quan h m
nh ngha 3.2: Quan h m trn tch -cc
Cho X,Y l hai tp v xX, yY. K hiu (x,y) l cp th t nm trong tch -
cc XxY. Tp m R = {(x,y),
R
(x,y)|(x,y) XxY} c gi l mt quan h m trn
XY vi hm thuc:
R
(x,y): XY [0,1]
Nu R l mt tp m trong X = X
1
X
2
.X
n
th R c gi l mt quan h m
n ngi.
nh ngha 3.3: Quan h m trn tp m
Cho X,Y l hai tp m v xX, yY. K hiu (x,y) l cp th t nm trong tch
-cc XY. R = {(x,y),
R
(x,y)|(x,y) XY} c gi l mt quan h m trn tp m
A, B nu:
R
(x,y)
A
(x,y), XY v
R
(x,y)
B
(x,y) XY
3.1.4. Cc php ton trn quan h m
Ngoi mt s php ton ging nh trn tp m trong tch -cc: Php hp, giao,
tng i s, tch i s, ngi ta cn a ra thm mt s php ton khc trong quan h
m nh sau:
Kha lun tt nghip Nguyn Vit Cng

28
Php hp thnh max-min
Gi s R
1
l quan h m trong XY, R
2
l quan h m trong YZ. Php hp
thnh max-min ca hai quan h m R
1
, R
2
(k hiu R
1
o R
2
) l mt quan h m trong XZ
tho mn:

R1oR2
(x,z) = max
y
(min(
R1
(x,y),
R2
(y,z))) xX, yY, zZ


Php hp thnh max-tch
Gi s R
1
l quan h m trong XY, R
2
l quan h m trong YZ. Php hp
thnh max-tch ca hai quan h m R
1
, R
2
(k hiu R
1
.R2) l mt quan h m trong XZ
tho mn:

R1.R2
(x,z) = max
y
(
R1
(x,y).
R2
(y,z)) xX, yY, zZ
Php hp thnh max-trung bnh
Gi s R
1
l quan h m trong XY, R
2
l quan h m trong YZ. Php hp
thnh max-trung bnh ca hai quan h m R
1
, R
2
(R1avR2) l quan h m trong XZ tho
mn:

R1avR2
(x,z) = max
y
((
R1
(x,y)+
R2
(y,z))/2) xX, yY, zZ
Php hp thnh max-*.(max-* composition) (* l ton t hai ngi bt k)
Gi s R
1
l quan h m trong XY, R
2
l quan h m trong YZ. Php hp
thnh max-* ca hai quan h m R
1
, R
2
(R
1
* R
2
) l mt quan h m trong XZ tho mn:

R1*R2
(x,z) = max(
R1
(x,y)*
R2
(y,z)) xX, yY, zZ
Hm tch hp m
Khi c mt tp cc tp m v tch hp cc hm thuc ca chng li, ta s thu
c mt tp m l mt hm tch hp m.
Kha lun tt nghip Nguyn Vit Cng

29
Mt hm tch hp m c nh ngha l mt ton t n ngi nh sau:
F: [0,1]
n
[0,1] tha mn iu kin:
Nu 0, 1 l hai im cc tr th: F(0,,0) = 0 v F(1,,1)=1 v a trong [0,1]
th: F(a,,a)=a
Nu a
i

> a
i
th: F(a
1
,,a
i

,,a
n
) F(a
1
,,a
i
,,a
n
) (tnh n iu tng ca hm
tch hp m)

Mt s hm tch hp m:

1. Hm trung bnh tng qut:

2. Hm trung bnh s hc:

3. Hm trung bnh hnh hc:

4. Hm min:

5. Hm max:

3.2. Biu din vn bn s dng cc khi nim m
Cch biu din vn bn thng thng l s dng m hnh khng gian vector,
trong vn bn c biu din bng mt vector v mi thnh phn ca vector l mt t
kha. chng II, kha lun nu ra mt s nhc im ca phng php ny: gy tn
km, phc tp trong vic lu tr, chiu ca mi vector l rt ln khi vn bn c nhiu t
kha Trong phn ny, chng ti xin trnh by mt phng php biu din vn bn m
phn no khc phc c nhc im nu trn, l phng php biu din vn bn s
dng cc khi nim m.
0 p R, p , ) (x
n
1
) x ,..., F(x
p
n
1 i
p
1 n 1
=

=
) 0 ( = p
n
1
n 2 1 n 1
) ...x .x (x ) x ,..., F(x
) ( min = p ) x ,..., x , (x ) x ,..., F(x
n 2 1 n 1
) ( max + = p ) x ,..., x , (x ) x ,..., F(x
n 2 1 n 1

=
= =
n
1 i
1 n 1
) (x
n
1
) x ,..., F(x ) 1 ( p
Kha lun tt nghip Nguyn Vit Cng

30
3.2.1. Khi nim m
C mt tp gm m vn bn: D=(d
1
, d
2
, d
m
).
Kh xc nh c mt tp p t kha: K = (k
1
, k
2
, k
p
)
Mt khi nim c th l mt t kha theo ngha thng thng, trong gm cc t c lin
quan n t kha . V d vi khi nim l bnh vin, n c th bao gm 1 s t kha:
bc s, y t, bnh nhn, ng nghe, thuc
Gi C l tp gm c n khi nim lin quan n vn bn, C c k hiu nh sau:
C = {c
1
, c
2
, c
n
}
Trong : c
i
l khi nim do ngi dng xc nh. Gi s mt khi nim c
i
s bao
gm cc t kha c lin quan, c
i
= {k
1
, k
2
, k
p
}, trong k
i
l cc t kha trong tp t
in v c lin quan n khi nim c
i
. Trong v d trn chng ta c bnh vin = {bc
s, y t, bnh nhn, ng nghe, thuc}.
nh ngha 3.3: Khi nim m
Mt tp m tng ng vi khi nim trong hm thuc ca n c xc nh
bng quan trng ca cc t kha c lin quan ti khi nim c gi l mt khi
nim m, k hiu c
*
. Ta c th biu din khi nim m qua tp t kha nh sau:
))} k ( , k )),...( k ( , k ( )), k ( , k {( c
p c p 2 c 2 1 c 1
*
=
Trong :



T khi nim m, ta c nh ngha sau:

nh ngha 3.4: Hm tch hp khi nim m:
Mt hm tch hp khi nim m l hm tch hp cc hm thuc ca cc khi nim
m. Hm tch hp ny c k hiu F: [0,1]
p
[0,1], tha mn cc tnh cht ca hm tch
hp m:

=
c nim khi thuc
i
k nu )
i
(k
c
c vo ton hon thuc
i
k nu 1
c thuc khng
i
k nu 0
)
i
(k
c

Kha lun tt nghip Nguyn Vit Cng



31
1.
2.

Trong ,
c
(k
i
) biu din mc quan trng ca cc t kha trong vn bn.
V d:
Gi s ta c tp t kha: bnh vin, trng hc, thuc, xe my, y t,
bnh nhn, ng nghe, sinh vin, hoa hng, in thoi, bc s.
K = {bnh vin, bc s, trng hc, thuc, xe my, y t, bnh nhn,
ng nghe, sinh vin, hoa hng, in thoi, bc s} vi lin quan n vn bn
c xc nh bng mt hm nh ch s tng ng:

K
= {(bnh vin), (bc s), (trng hc), (thuc), (xe my), (y
t), (bnh nhn), (ng nghe), (sinh vin), (hoa hng), (in thoi), (bc
s)}
= {0.8, 0.7, 0.1, 0.4, 0.0, 0.3, 0.6, 0.3, 0.0, 0.1, 0.0, 0.2}
Ta tm c mt cm t kha c lin quan n nhau trong trong vn bn: {bnh
vin, bc s, thuc, bnh nhn, ng nghe}
Chn t kha bnh vin lm khi nim, th khi nim m c* = bnh vin c
biu din nh sau:
bnh vin = {(bc s, 0.7), (thuc, 0.4), (bnh nhn, 0.6), (ng nghe,
0.3)}
Khi , quan trng trong vn bn ca bnh vin c xc nh bi hm tch
hp khi nim m:
(bnh vin) = F( (bc s), (thuc), (bnh nhn), (ng nghe))
Nu hm tch hp l hm MAX th:
(bnh vin) = MAX(0.7, 0.4, 0.6, 0.3) = 0.6
Nu hm tch hp l hm trung bnh th:
(bnh vin) = AVEG(0.7, 0.4, 0.6, 0.3) = 0.55
p 1,..., i ), (k ) (k vi
)) (k ),..., (k ),..., (k F( )) (k ),..., (k ),..., (k F(
i c
'
i c
p c i c 1 c p c
'
i c 1 c
= >



[0,1] )) (k ),..., (k F(
p c 1 c

Kha lun tt nghip Nguyn Vit Cng

32
3.2.2. Biu din vn bn
Vi cch nh ngha khi nim m nh trn, ta c th biu din vn bn bng
cch xem n nh mt vector c cc thnh phn l cc khi nim m thay v mt vector
vi cc thnh phn l cc t kha.
Khi , mt vn bn d s c biu din di dng sau:
d = (c
1
*
)/ c
1
*

+ (c
2
*
)/c
2
*
+ + (c
n
*
)/(c
n
*
)
Trong (c
i
*
) l l mc quan trng ca khi nim m c*
i
ca vn bn. (c
i
*
)
c xc nh bng hm tch hp m ca tp cc t kha lin quan n khi nim c
i
*
.
Ch rng trong phng php biu din ny, mt t kha cng c th coi nh l
mt khi nim m khi ng nht t kha vi khi nim m.
Nu trong cc khi nim, khi nim no c cc t kha lin quan n vn bn ln
hn th trng s ca n s ln hn, v nh vy ng ngha ca n cng s r rng hn.
Mt vn t ra l tm tp cc t kha biu din cho mt khi nim m, cc t
kha ny phi lin quan n nhau v c ngha tng t nhau. Vic pht trin mt thut
ton nh vy hin nay cn l mt vn . Thng thng c hai cch chnh nh sau:
1. Xc nh tp cc khi nim bng tri thc con ngi: Ngi dng t xc nh cc
t kha c lin quan theo cm nhn ca mi ngi, hoc chn cc t kha i din
cho vn bn . Vic ny tuy a li kt qu chnh xc kh cao ( c thc
hin trong cc h ln nh Yahoo!) tuy nhin rt mt nhiu thi gian v cng sc.
2. Pht trin cc thut ton t ng: S dng cc k thut ca ngnh x l ngn ng
t nhin xc nh cc t kha c lin quan vi nhau. Cc thut ton nh vy
hin cng ang l mt ch nng trong cc bi ton x l ngn ng t nhin.
Mc ch trong nghin cu ny l chng ti mun th nghim vic biu din x
dng cc khi nim m trong bi ton phn lp vn bn. Cc khi nim m c xc nh
trc da trn tp ch c xc nh trc. Chi tit cc khi nim m p dng c
m t trong phn thc nghim.
Kha lun tt nghip Nguyn Vit Cng

33
3.2.3. xut gii php cho vn ng ngha
Trong vn bn, thng xut hin mt s t ng ngha (hoc c ngha gn nhau).
S xut hin ny s lm cho vic biu din vn bn kh khn hn v khng gim c s
chiu ca vector biu din.
Kha lun ny xin xut mt phng php tm ra v x l cc t ng ngha
trong vn bn nh sau:
Tm ra t ng ngha
Chng ti s dng s h tr ca t in Wordnet. Wordnet l T in ngn ng
hc cho ting Anh, c gii thiu vo nm 1985 ti phng th nghim khoa hc ca
trng i hc Princeton.
Wordnet s cung cp:
Nhm cc t ting Anh thnh mt tp nhng t ng ngha gi l synsets
Cung cp nhng nh ngha ngn, tng qut v ghi li nhiu quan h ng
ngha hc gia nhng tp t ng ngha ny.
Phn bit ng t, danh t, tnh t bi chng i theo nhng quy tc vn phm
khc nhau.
Mc ch: To ra s kt hp gia t in v danh sch cc t ng ngha, h tr
vic phn tch vn bn v ng dng trong AI.

V d:
Computer: a machine for performing calculations automatically.
syn: data processor, electronic computer, information processing system.
T mt t kha trong tp cc t kha, kt hp vi t in wordnet, ta tm ra
nhng t ng ngha vi t kha . Tm giao ca 2 tp: Tp t kha v tp t ng
ngha, chng ta s tm ra c mt nhm cc t ng ngha trong tp t kha c.

X l t ng ngha:
Vi tp t ng ngha xut hin trong vn bn m ta va tm c, bng cch s
Kha lun tt nghip Nguyn Vit Cng

34
dng hm tch hp m, ta tch hp chng li trong mt khi nim chung. Vic x l vn
bn thay v vic tnh ton trn cc t kha, s tnh ton trn mt khi nim ny. Lm nh
vy, ta s gim bt c s chiu ca vector biu din, gim s phc tp trong tnh ton
v trnh gy nn s kh hiu cho ngi s dng khi bt gp cc t ng ngha trong vn
bn.
Kha lun tt nghip Nguyn Vit Cng

35
Chng 4. CC PHNG PHP PHN LP VN BN

Trong chng ny, chng ti trnh by v bi ton phn lp vn bn v cc thut
ton p dng vo bi ton .

4.1. Tng quan v bi ton phn lp
Phn lp vn bn t ng l mt lnh vc c ch nht trong nhng nm gn
y. phn lp ngi ta s dng nhiu cch tip cn khc nhau nh da trn t kha,
da trn ng ngha cc t c tn s xut hin cao, m hnh Maximum Entropy, tp th,
tp m Mt s lng ln cc phng php phn lp c p dng thnh cng trn
ngn ng ny : m hnh hi quy , phn lp da trn lng ging gn nht (k-nearest
neighbors), phng php da trn xc sut Nave Bayes, cy quyt nh, hc lut quy np,
mng nron (neural network), hc trc tuyn, v my vector h tr (SVM-support vector
machine)
Bi ton: Cho mt tp hp cc vn bn c ngi dng phn lp bng tay
vo mt s ch c sn. Khi a vo mt vn bn mi, yu cu h thng ch ra tn ch
ph hp nht vi vn bn .
Theo cch truyn thng, vic phn lp vn bn c th thc hin mt cch th
cng, tc l c tng vn bn mt v gn n vo nhm ph hp. Cch thc ny s tn
nhiu thi gian v chi ph nu nh s lng vn bn ln. Do vy, cn phi xy dng cc
cng c phn lp vn bn mt cch t ng.
xy dng cng c phn loi vn bn t ng ngi ta thng dng cc thut
ton hc my (machine learning). Tuy nhin, cn c cc thut ton c bit hn dng
cho phn lp trong cc lnh vc c th. Mt v d in hnh l bi ton phn lp cng
vn giy t. H thng s tm ra c th ca vn bn mt cch tng i my mc, c th
l khi tm thy l trn bn tri c k hiu N th h thng s phn vn bn vo
nhm Ngh nh, tng t nh vy vi cc k hiu CV, Q th h thng s phn
vn bn ny vo nhm Cng vn, v Quyt nh. Thut ton ny tng i hiu qu
song li ch ph hp cho cc nhm d liu tng i c th. Khi phi lm vic vi cc
Kha lun tt nghip Nguyn Vit Cng

36
vn bn t c th hn th cn phi xy dng cc thut ton phn lp da trn ni dung
ca vn bn v so snh ph hp ca chng vi cc vn bn c phn lp bi con
ngi. y l t tng chnh ca thut ton hc my. Trong m hnh ny, cc vn bn
c phn lp sn v h thng ca chng ta phi tm cch tch ra c th ca cc vn
bn thuc mi nhm ring bit. Tp vn bn mu dng hun luyn gi l tp hun
luyn (train set) hay tp mu (pattern set), qu trnh phn lp vn bn bng tay gi l qu
trnh hun luyn (training), cn qu trnh my t tm c th ca cc nhm gi l qu
trnh hc (learning). Sau khi my hc xong, ngi dng s a cc vn bn mi vo
v nhim v ca my l tm ra xem vn bn ph hp nht vi nhm no m con ngi
hun luyn n.
Trong phn lp vn bn, s ph thuc ca mt vn bn vo mt nhm c th l
cc gi tr ri rc ng v sai (true hoc false), hiu rng vn bn thuc hay khng
thuc mt nhm no ) hoc cc gi tr lin tc (mt s thc, ch ra ph hp ca vn
bn vi mt nhm no ). Khi cu tr li a ra bt buc phi l ng hoc sai th gi tr
lin tc c th c ri rc ha bng cch t ngng. Ngng t ra ty thuc vo thut
ton v yu cu ngi dng.
4.2. Cc thut ton phn lp
C rt nhiu thut ton khc nhau gii quyt bi ton phn lp vn bn, trong
chng ny chng ti xin trnh by mt s thut ton sau:
4.2.1. Phn lp da trn thut ton Naive Bayes
Naive Bayes l mt trong nhng phng php phn lp da vo xc sut in
hnh nht trong khai thc d liu v tri thc
Cch tip cn ny s dng xc sut c iu kin gia t v ch d on xc
sut ch ca mt vn bn cn phn lp. ng thi, gi nh rng s xut hin ca tt
c cc t trong vn bn u c lp vi nhau. Nh th, n s khng s dng vic kt hp
cc t a ra phn on ch nh mt s phng php khc, iu ny s lm cho
vic tnh ton Naive Bayes hiu qu v nhanh chng hn.
Cng thc
Kha lun tt nghip Nguyn Vit Cng

37
Mc ch chnh l tnh c xc sut Pr(C
j
, d) , xc sut vn bn d nm trong
lp C
j
. Theo Bayes, vn bn d s c gn vo lp C
j
no c xc sut Pr(C
j
, d) cao nht.
Cng thc sau dng tnh Pr(C
j
, d):

Trong :
- TF(w
i
,,d) l s ln xut hin ca t w
i
trong vn bn d
- |d| l s lng cc t trong vn bn d
- w
i
l mt t trong khng gian c trng F vi s chiu l |F|
- Pr(C
j
): l t l phn trm ca s vn bn mi lp tng ng trong tp d liu
luyn:

Pr(w
i
|C
j
) s dng php c lng Laplace:

Naive Bayes l mt phng php rt hiu qu trong mt s trng hp. Nu tp
d liu hun luyn ngho nn v cc tham s d on (nh khng gian c trng) c cht
lng km th s dn n kt qu ti. Tuy nhin, n c nh gi l mt thut ton phn
lp tuyn tnh thch hp trong phn lp vn bn nhiu ch vi mt s u im: ci t
n gin, tc nhanh, d dng cp nht d liu hun luyn mi v c tnh c lp cao
Kha lun tt nghip Nguyn Vit Cng

38
vi tp hun luyn, c th s dng kt hp nhiu tp hun luyn khc nhau. Thng
thng, ngi ta cn t thm mt ngng ti u cho kt qu phn lp kh quan.
4.2.2. Phn lp da trn thut ton K - Nearest Neighbor (KNN)
Thut ton phn lp KNN [4] l mt phng php truyn thng v kh ni ting
trong hng tip cn da trn thng k, c nghin cu trong nhn dng mu trong
vi thp k gn y.
N c nh gi l mt trong nhng phng php tt nht v c s dng
ngay t nhng thi k u ca phn lp vn bn .
Mun phn lp mt vn bn mi, thut ton KNN s tnh khong cch (Euclide,
Cosine ) ca vn bn ny n cc vn bn trong tp hun luyn v chn ra k vn bn c
khong cch gn nht, cn gi l k lng ging. Dng cc khong cch va tnh c
nh trng s cho cc ch c. Trng s ca mt ch s c tnh bng tng cc
khong cnh t vn bn cn phn lp n cc vn bn trong k lng ging m c cng ch
. Nhng ch khng xut hin trong tp k vn bn s c trng s bng 0. Cc ch
c sp xp theo gim dn ca cc trng s v ch no c trng s cao s l
ch cho vn bn cn phn lp.
Cng thc
Trng s ca ch c
j
i vn bn x :
W( x ,c
j
) =


KNN d
j j i i
i
b ) c , d ( y ). d , x ( sim

Trong :
-
) c , d ( y
j i
{0,1} vi:
y = 0: vn bn
i
d khng ph thuc ch c
j

y = 1: vn bn
i
d thuc v ch c
j
- ) d , x sim(
i
ging nhau gia vn bn cn phn loi x v vn bn
i
d . S
dng o cosin tnh ) d , x sim(
i
:
Kha lun tt nghip Nguyn Vit Cng

39

) d , x sim(
i
= ) d , x ( cos
i
=
i
i
d . x
d . x

- b
j
l ngng phn loi ca ch c
j
, c t ng hc s dng mt tp vn
bn hp l chn ra t tp hun luyn
Khi s vn bn trong tp vn bn lng ging cng ln th thut ton cng n nh
v sai st thp.
4.2.3. Phn lp da vo thut ton cy quyt nh
Cy quyt nh (Decision Tree) l mt trong cc phng php c s dng rng
ri trong hc quy np t tp d liu ln. y l phng php hc xp x cc hm mc
tiu c gi tr ri rc. Mt u im ca phng php cy quyt nh l c th chuyn d
dng sang dng c s tri thc l cc lut Nu - Th (If - Then).
Trn cy gm cc nt trong c gn nhn bi cc khi nim, cc nhnh cy cha
nt c gn nhn bng cc trng s ca khi nim tng ng i vi ti liu mu v cc
l trn cy c gn nhn bi cc nhn nhm. Mt h thng nh vy s phn loi mt ti
liu d
i
bi php th quy cc trng s m cc khi nim c gn nhn cho cc nt
trong vi vector
i
d cho n khi vi ti mt nt l, khi nhn ca nt ny c gn cho
ti liu d
i
. a s cc phng php phn loi nh vy s dng biu din dng nh phn
v cc cy cng c biu din di dng nh phn.
V d cy quyt nh:

Hnh 7. Cy quyt nh
Kha lun tt nghip Nguyn Vit Cng

40
Entropy
Entropy l i lng o ng nht thng tin hay tnh thun nht ca cc mu.
y l i lng ht sc quan trng trong l thuyt thng tin.
i lng Entropy c tnh nh sau:

=

n
i
i i
p p S
1
) Entropy(
2
log
trong p
i
l phn b ca thuc tnh th i trong S.
Information Gain
Information Gain l i lng o nh hng ca mt thuc tnh trong mu
trong vic phn lp.
Information Gain ca mt thuc tnh A trong tp S, k hiu l ) Gain(S, A c
xc nh nh sau:


) Values(
) Entropy( ) Entropy( ) , Gain(
A v
v
v
S
S
S
S A S
trong Values(A) l tp cc gi tr c th ca thuc tnh A, cn S
v
l tp con c
th ca tp S cc phn t c thuc tnh A = v, tc l } ) ( | { v s A S s S
v
= = .
Biu thc u Entropy (S) l i lng entropy nguyn thy ca tp S, biu thc
sau l gi tr k vng ca entropy sau khi S c chia thnh cc tp con theo cc gi tr
ca thuc tnh A. Do Gain(S,A) thc cht l gim k vng ca entropy khi bit cc
gi tr thuc tnh A. Gi tr Gain(S,A) l s bit cn lu khi m ha cc gi tr mc tiu ca
mt thnh phn ca S, khi bit cc gi tr ca thuc tnh A.
Thut ton ID3
ID3 v ci tin ca n, C4.5, l cc thut ton hc cy quyt nh ph bin nht.
Nhn chung, thut cc bc trong thut ton ID3 c th c pht biu mt cch
ngn gn nh sau:
- Xc nh thuc tnh c o Information Gain cao nht trong tp mu
Kha lun tt nghip Nguyn Vit Cng

41
- S dng thuc tnh ny lm gc ca cy, to mt nhnh cy tng ng vi mi
gi tr c th ca thuc tnh.
- ng vi mi nhnh, ta li lp li qu trnh trn tng ng vi tp con ca tp
mu (training set) c phn cho nhnh ny.
4.2.4. Phn lp s dng Support Vector Machines (SVM)
My s dng vector h tr (SVM)[12] c Cortess v Vapnik gii thiu nm
1995, l phng php tip cn phn lp hiu qu gii quyt vn nhn dng mu 2
lp s dng nguyn l Cc tiu ha Ri ro c Cu trc (Structural Risk Minimization).
Trong khng gian vector cho trc mt tp hun luyn c biu din trong
mi ti liu l mt im, thut ton SVM s tm ra mt siu mt phng h quyt nh tt
nht c th chia cc im trn khng gian ny thnh hai lp ring bit tng ng lp + v
lp . Cht lng ca siu mt phng phn cch ny c quyt nh bi khong cch
(gi l bin) ca im d liu gn nht ca mi lp n mt phng ny. Khong cch bin
cng ln th mt phng quyt nh cng tt v vic phn lp cng chnh xc. Mc ch
thut ton SVM l tm c khong cch bin ln nht. Hnh sau minh ha cho thut ton
ny:


Hnh 8: Siu mt phng h phn chia d liu hun huyn thnh 2 lp + v - vi khong cch bin
ln nht. Cc im gn h nht (c khoanh trn) l cc vector h tr - Support Vector

Kha lun tt nghip Nguyn Vit Cng

42
Cng thc
Phng trnh siu mt phng cha vector
i
d trong khng gian:
0 . = + b w d i
t

< +
> + +
= + =
0 b w . d , 1
0 b w . d , 1
) b w . d ( sign ) d ( h
i
i
i i
khi
khi

T , ) (
i
d h biu din s phn lp ca
i
d vo hai lp ni trn.
C
i
y = {1}, th vi y
i
= +1, vn bn
i
d lp +; vi y
i
= - 1,
i
d lp -. Lc
ny mun c siu mt phng h, ta s gii bi ton sau:
Tm Min || w||, trong w v b tha mn iu kin:
1 )) . ( ( : , 1 + b w d sign y n i
i i

Khi ta c th s dng ton t Lagrange bin i thnh dng thc gii bi
ton.
phng php SVM, mt phng quyt nh ch ph thuc vo cc im gn n
nht (vector h tr - support vector) m c khong cch n n l:
w
1
. Khi cc im
khc b xa i th vn khng nh hng n kt qu ban u.
Kha lun tt nghip Nguyn Vit Cng

43
Chng 5. MT S KT QU THC NGHIM
5.1. Tp d liu v tin x l
M t d liu u vo
Chng ti tin hnh th nghim trn c s d liu l b phn lp chun
20newsgroup c 19997 ti liu, phn ra thnh 20 lp. Mi lp c khong 1000 ti liu
Tuy nhin, do s lng vn bn ca tp d liu ny rt ln, nn chng ti ch
chn ra 2 lp th nghim. l cc lp: rec.sport.; rec.autos .
Chng ti tin hnh thc nghim vi s lng ti liu trong 2 lp l: 50, 100, 500
. Trong t l gia tp train v tp test l: 2/1.
Biu din vn bn qua cc t kha
Vi tp d liu kim tra, trc tin chng ti thc hin bc tin x l: loi b
cc t dng. Chng trnh loi b t dng c vit bng ngn ng C/C++. Sau ,
chng ti to ra tp t kha ca tp d liu.
Sau khi hon thnh cc bc trn, tin hnh biu din vn bn. Cc vn bn s
c biu din di dng vector cc t kha. nh dng ca vector nh sau:
Name_VB(id
1
,TS
1
; id
2
,TS
2
;id
n
,TS
n
), trong : Name_VB l tn vn bn, id
i
l ch s
ca t kha th i trong tp t kha trn; TS
i
l trng s ca t kha th i. Trng s ca
t kha c tnh bi cng thc TF.IDF (chng II). Chng trnh ny c vit bng
ngn ng C/C++.
Biu din vn bn qua cc khi nim m
T tp t kha to c trn, chng ti tm ra c nhng cm t kha c lin
quan vi nhau. T xc nh cc khi nim m i din cho cm .
Mt vi v d v cch biu din cc cm t kha lin quan v cc khi nim m
nh sau:
sport = {hockey, baseball, sport, winner, finals, won, game, teams, played,
season , cup, stars, fans, newspaper, begin, goaltender, league, day, distribution, pic,
predictions, champions, scorer, power, driver, time, finished, worried, public, matter,
blow, car, miles}
Kha lun tt nghip Nguyn Vit Cng

44
wood = {hiller, wood}
academic = {university, academic, computer, science, school, laboratory,
college, staff, chemistry, technology, operating}
humour = {jokes, humour, coyote, average}
cool = {cool, air}
speed = {speed, coordination, skating, hilarious, observation, advised, ranked}
raise = {breakthrough, raise}
Sau khi xc nh c cc khi nim m v kt hp vi tp t kha trong vn
bn, chng ti biu din vn bn theo cc khi nim m . T trng s ca cc t kha
trong vn bn, s dng hm tch hp m MAX, ta xc nh c lin quan ca cc
khi nim m trn i vi tng vn bn. Nu cc t kha c lin quan trong khi nim m
khng nm trong vn bn ang xt, lin quan ca khi nim n vn bn s bng 0.

5.2. Cng c v phng php phn lp
Cng c phn lp:
Chng ti s dng cng c phn lp l phn mm Weka. Kha lun xin cung
cp mt s thng tin v weka nh sau:
Weka l mt phn mm ngun m v khai ph d liu c pht trin bi
i hc University of Waikato nc New Zealand. Weka l t vit tt cho
cm t Waikato Environment for Knowledge Analysis. Weka c th c
s dng nhiu cp khc nhau. Cp n gin ca n l ta c th p
dng cc thut ton ca Weka x l d liu t dng lnh. N bao gm
nhiu cc cng c dng cho vic bin i d liu, nh cc thut ton dng
ri rc ha d liu
Weka cung cp tt c cc chc nng c bn ca khai ph d liu bao gm
cc thut ton v phn lp (classifier), cc thut ton v tin x l d liu
(filter), cc thut ton v phn cm (cluster), cc thut ton v kt tp lut
(association rule).
Kha lun tt nghip Nguyn Vit Cng

45
Chng trnh chuyn t biu din theo t kha v theo khi nim m sang d
liu chun ca weka c vit bng ngn ng Java. Tp vn bn trc
c biu din theo vector trng s cc t kha tr thnh Input. Sau , tin
hnh chy chng trnh v cho ra output l file d liu chun ca weka.
Phng php phn lp :
Trong chng 4, kha lun trnh by v mt s thut ton phn lp. Qua phn
tch, chng ti nhn thy rng: phng php K - Nearest Neighbor l mt phng php
n gin v c nh gi l mt trong nhng phng php tt v cho hiu qu cao. V
vy, chng ti chn thut ton K - Nearest Neighbor trong thc nghim ny
5.3. Kt qu thc nghim
C hai i lng thng dng o hiu sut ca phn lp vn bn, l
precision ( chnh xc) v recall ( hi tng). Ngoi ra ngi ta cn xc nh
thm thng s F1. F1 l mt ch s cn bng gia chnh xc v hi tng,
F1 s ln khi chnh xc v hi tng ln v cn bng, F1 s nh khi
chnh xc v hi tng nh hoc khng cn bng. Mc tiu trong cc bi ton
l F1 cng cao cng tt.
Trong 1 lp vn bn C, khi kim tra chnh xc trong phn lp, ngi ta
xc nh 3 i lng: precision, recall v F1 nh sau:

Trong :
num_of_match: l s lng vn bn c gn nhn ng
num_of_model: l s lng vn bn c m hnh gn nhn ca lp C
num_of_manual: l s lng vn bn thuc v lp C
Kha lun tt nghip Nguyn Vit Cng

46
Khi phn lp vi b d liu 50 vn bn. Trong : Traning set: 34, Test set: 16.
Sau khi th vi mt s gi tr ca tham s k, chng ti thy k = 2, c kt qu phn
lp cao nht.
Biu din vn bn bng tp t kha:

Kt qu output nh sau:
Correctly Classified Instanse: 10 62.5%
InCorrectly Classified Instances: 6 37.5%
----------------------
Precision Recall F1 Class
1 0.25 0.4 rec.autos_25
0.571 1 0.72 rec.Sport_25
----------------------
a b <---classified as
2 6 | a = rec.autos_25
0 8 | b = rec.sport_25
Kha lun tt nghip Nguyn Vit Cng

47
Biu din bng cc khi nim m:


Kt qu output nh sau:
Correctly Classified Instanse: 13 81.25%
InCorrectly Classified Instances: 3 18.75%
----------------------
Precision Recall F1 Class
1 0.625 0.769 rec.autos_25
0.727 1 0.842 rec.Sport_25
----------------------
a b <---classified as
5 3 | a = rec.autos_25
0 8 | b = rec.sport_25
Kha lun tt nghip Nguyn Vit Cng

48

Khi phn lp vi b d liu 100 vn bn. Trong : Traning set: 64, Test set: 32.
Cng ging nh vi b d liu gm 50 vn bn, chng ti nhn thy rng trong
b d liu ny, k = 2 cho ta kt qu phn lp cao nht.
Biu din vn bn bng tp t kha:

Kt qu ra nh sau:
Correctly Classified Instanse: 23 71.875%
InCorrectly Classified Instances: 9 28.125%
----------------------
Precision Recall F1 Class
1 0.438 0.609 rec.autos_50
0.64 1 0.780 rec.Sport_50
----------------------
a b <---classified as
7 9 | a = rec.autos_50
0 16 | b = rec.sport_50
Kha lun tt nghip Nguyn Vit Cng

49
Biu din vn bn bng cc khi nim m:

Kt qu ra nh sau:
Correctly Classified Instanse: 27 84.375%
InCorrectly Classified Instances: 5 15.625%
----------------------
Precision Recall F1 Class
1 0.688 0.815 rec.autos_50
0.762 1 0.865 rec.Sport_50
----------------------
a b <---classified as
11 5 | a = rec.autos_50
0 16 | b = rec.sport_50
Khi phn lp vi b d liu 500 vn bn. Trong : Traning set: 334, Test
set: 166. Khng ging vi hai trng hp trn, trong trng hp ny, k = 8 s cho
kt qu phn lp tt nht.
Kha lun tt nghip Nguyn Vit Cng

50
Biu din vn bn bng tp t kha:


Kt qu nh sau:
Correctly Classified Instanse: 131 78.916%
InCorrectly Classified Instances: 35 24.084%
----------------------
Precision Recall F1 Class
0.929 0.627 0.748 rec.autos_250
0.718 0.952 0.818 rec.Sport_250
----------------------
a b <---classified as
52 31 | a = rec.autos_250
4 79 | b = rec.sport_250

Biu din vn bn bng tp cc khi nim m:
Kha lun tt nghip Nguyn Vit Cng

51


Kt qu nh sau:
Correctly Classified Instanse: 146 87.952%
InCorrectly Classified Instances: 20 12.048%
----------------------
Precision Recall F1 Class
0.957 0.795 0.868 rec.autos_250
0.825 0.964 0.889 rec.Sport_250
----------------------
a b <---classified as
66 17 | a = rec.autos_250
3 80 | b = rec.sport_250

Qua ba trng hp trn, chng ti nhn thy: khi s lng vn bn trong tp
training v tp test nh, kt qu phn lp vi thut ton kNN khng cao. Khi tng dn s
Kha lun tt nghip Nguyn Vit Cng

52
lng vn bn ln, kt qu c tt hn. Gi tr ca tham s k cng t l thun vi s lng
ny.
th so snh gia vic biu din vn bn theo khi nim m v vic biu din
theo t kha thng thng:
Kha lun tt nghip Nguyn Vit Cng

53
KT LUN V HNG PHT TRIN

Kt qu t c
Da vo cc nghin cu gn y trong bi ton x l vn bn, kha lun
nghin cu, chn lc cng nh pht trin mt s vn v t c nhng kt qu ban
u nh sau:
Tm hiu v trnh by c mt s phng php biu din vn bn.
Nghin cu v l thuyt tp m v cc php ton lin quan. Qua gii thiu
c mt phng php biu din vn bn da trn cc khi nim m.
Nghin cu v tm hiu v bi ton phn lp, trnh by mt s thut ton
phn lp tiu biu.
C c nhng kt qu th nghim, nhng so ban u khi p dng cch biu
din vn bn mi vi cch biu din thng thng. Qua thy c mt s
u im:
Gim bt c s chiu ca vector vn bn khi biu din. Gim bt
s phc tp khi tnh ton.
Cho kt qu tt hn khi p dng vo bi ton phn lp vi thut
ton kNN

Hng pht trin
Chng ti xin xut mt phng php tm ra cc t kha c lin quan trong
vn bn: C mt tp cc t kha trong tp vn bn qua bc tin x l.
Trong tp t kha ny, ln lt tm nhng cm t cng xut hin trong cc
vn bn v m s ln xut hin . t ra mt ngng , nu s ln xut
hin vt qua ngng ny th ta c th coi rng cc t trong cm c lin
quan n nhau. C nhiu cch chn mt t kha trong cm ny lm khi
nim, chng hn ly t c trng s cao nht.
Kha lun tt nghip Nguyn Vit Cng

54
Thi gian ti, chng ti s tin hnh phn lp trn cc thut ton khc nhau:
Nave Bayes, Cy quyt nh, SVM so snh gia cc kt qu phn lp
v tm ra thut ton phn lp tt nht khi p dng phng php biu din
theo khi nim m.
Kha lun tt nghip Nguyn Vit Cng

55
TI LIU THAM KHO
Ting Vit:
[1]. inh Trung Hiu, V Bi Hng, Nguyn Cm T, Gii php tm kim theo
lnh vc trong my tm kim, Bo co nghin cu khoa hc Khoa Cng Ngh, HQGHN
nm 2004.
[2]. on Sn (2002) Phng php biu din vn bn s dng tp m v ng
dng trong khai ph d liu vn bn Lun vn thc s Khoa Cng Ngh, HQGHN, nm
2002.

Ting Anh:
[1]. D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, MIT Press,
Cambridge, MA, 2001
[2]. D.Lewis, Representation and Learning in Information Retrieval, PhD Thesis,
Graduate School of the University of Massachusetts, 1991
[3]. D. Tikk, J. D. Yang, and S. L. Bang, Hierarchical text categorization using
fuzzy relational thesaurus. Kybernetika, 39(5), pp. 583600, 2003
[4]. Eui-Hong Han, Text Categorization Using Weight Adjusted k-Nearest
Neighbor Classification. PhD thesis, University of Minnesota, October 1999
[5] F. Sebastiani, Machine learning in automated text categorization, Technical
Report IEI-B4-31-1999, Consiglio Nazionale delle Ricerche, Pisa, Italy, 1999.
[6]. G. Piatetsky Shapiro, W. Frawley (Eds), Knowledge Discovery in Databases,
MIT Cambridge, MA,1991.
[7]. Ian H.Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools
and Techniques, 2nd edition, June 2005
[8]Maria-Luiza Antonie, Osmar R. Zaane, Text Document Categorization by
Term Association, IEEE International Conference on Data Mining, pages 19--26,
December 2002.
Kha lun tt nghip Nguyn Vit Cng

56
[9]. M. Grabisch, S.A.Orlovski, R.R.Yager. Fuzzy aggregation of numerical
preferences, In R. Slowinski, editor, Fuzzy Sets in Decision Analysis, Operations
Research and Statistics, pages 31-68
[10]. Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval,
Addison Wesley, 1999
[11]. S. Eyheramendy, D. Lewis and D. Madigan, On the Naive Bayes Model for
Text Categorization, In Proceedings of Artificial Intelligence & Statistics 2003.
[12]. T. Joachims, Text categorization with Support Vector Machines: Learning
with many relevant features. In Machine Learning: ECML-98, Tenth European
Conference on Machine Learning, pp. 137-142
[13]. W. Frawley, G. Piatetsky-Shapiro, C. Matheus, Knowledge Discovery in
Databases: An Overview. AI Magazine, Fall 1992.
[14]. H.J. Zimmerman, Fuzzy set Theory and Its Applications, Kluwer Academic
Publishers, 1991