Vous êtes sur la page 1sur 4

Association Rules Mining for Identifying Popular

Ingredients on YouTube Cooking Recipes Videos


Boby Siswanto, Putri Thariqa
Informatics Study Program
School of Creative Technology Bina Nusantara Bandung
Bandung, Indonesia
boby.siswanto@binus.ac.id, putri.thariqa@binus.ac.id

Abstract—YouTube provides a lot of videos that will be IV describes data preparations, includes data structure being
able to create dataset. YouTube video has some characteristics used. Section V shows the IST-EFP algorithm. Section VI
on number of views, likes, dislikes and comments. Association discuss about the experiments, includes obtained numbers.
rules mining able to find the most dominant item in a dataset. Finally, section VII extracts the conclusion.
This research investigates 40 random videos on YouTube by
implementing association rules mining algorithm to find what
is the most ingredients used in Indonesia cooking recipes. This II. METHOD
research found that the most liked video use 2 main ingredient Research method applied on this research to achieve the
which are garlic and onion. This research also implements IST- goal is divided into three processes. The first one is to create
EFP algorithm for reducing the dimensional of the dataset a dataset by doing a pre-process on YouTube videos. Pre-
without loss on important rules obtained. This research found process is done by implementing ETL mechanism using
IST-EFP able to reduce 19% on dataset dimension with 0.7% Oracle SQL Developer tools [5][6]. The second process is
loss on rules obtained. applying association rules mining algorithm directly onto
YouTube Cooking recipes dataset to obtain ingredient
Keywords—association rule mining, YouTube dataset, IST-
EFP algorithm, cooking recipes ingredients
patterns. The third one is reducing original dataset with IST-
EFP algorithm to obtain reduced dataset and then processing
it with association rules mining to obtain another ingredient
I. INTRODUCTION pattern [3]. Both obtained rules compared on ingredient
YouTube is one of a huge video hosting that is exists, patterns to find the level of similarity.
dataset might be produced from YouTube [1]. One of
YouTube video genre is cooking recipes. Indonesian cooking
recipes is one kind of cooking recipes genre. Every YouTube Start
videos has characteristic values such as views, like, dislike
and comments. If a video got many likes means user likes it.
YouTube
Cooking recipes will consist of some ingredients. They will Cooking
be able to be recognized about the items composition of the Recipes
Data
ingredients [2].
Association Rules Mining (ARM) is one of data mining
technique for identifying relation between several items on a Pre-Process

dataset. There will be 2 values that will be considered on


association rules mining that are support value and
confidence value. Association rules mining results can be YouTube
Cooking
Reduced
Coocking
IST-EFP Algorithm
used as recommendation to take managerial decisions [3]. Recipes Recipes
Dataset Dataset
One technique for improving the time process of Association
Rules Mining is reduce the data set dimension. This
technique already implemented on clinical dataset [4]. ARM
should be able to recognize the items composition of cooking Association Rules
Generation
Association Rules
Generation
recipes.
This study has two purposes. The first purpose is to find
the most used ingredient pattern of Indonesian cooking
Ingredients Comparison and Ingredients
recipes which is liked by YouTube visitors by implementing Patterns Conclusion Patterns
Association Rules Mining. The second purpose is testing the
validity of the IST-EFP algorithm on the cooking recipes
dataset.
Stop
This paper organized into sections as follow. Section II
describes the methodology, includes the processes. Section Fig. 1: The Flow Chart of Research Method
III describes the theory of Association Rules Mining. Section
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia 95
III. ASSOCIATION RULES MINING TABLE II. INGREDIENTS TABLE STRUCTURES
Association Rules Mining (ARM) is one of technique in INGREDIENTS
data mining area which is used to find patterns of a dataset.
ATTRIBUTE DATA_TYPE
The ARM’s dataset usually in a form of a time series dataset
which is obtained from periodically recurrent transactions. Id (FK) NUMBER
The obtained pattern is called frequent patterns, consist of Name VARCHAR2(64)
items combination with strong or weak relations. Usually
the strong relationship of frequent patterns will be used for The obtained data will be stored in database with the
creating decisions. structures Recipes table seen on Table I and Ingredients table
There are 2 highly considered values in Association seen on Table II [13]. Those tables have 1-N cardinality,
Rules Mining which are Support Value and Confidence seen on Figure 2.
Value. Support value will indicate how often an item appear
against all of transaction of dataset records and the
confidence value indicates how strong the relationship
between one item to another item on one rules/pattern. RECIPES INGREDIENTS
[3][4]
PK Id Id
Sup (X) = ( X) / T (1)
Name
Title
Sup (X,Y) = (X, Y) / T (2)

Conf (X  Y) = (Sup (X, Y)) / (Sup (X)) (3) Link

Conf (X  Y) = ((X  Y)) / X (4) Date_Published

Formula to obtain support value can be seen on equation Views


(1) and (2). Formula to obtain confidence value can be seen
on equation (3) and (4).
Likes

Dislikes
IV. DATA PREPARATIONS
The first step of this research is collecting YouTube Comments
cooking recipes dataset. This task is done by watching some
videos on YouTube and collecting the cooking recipes
Access_Date
manually. Obtained data then transferred into database by
implementing ETL mechanism using Oracle SQL Developer.
40 YouTube videos are used in this research. The YouTube Fig. 2. Cooking Recipes ERD
video that being used was published between January 2018
until May 2018. The reason of choosing the period is to
obtain the convenient statistic of data by limiting time
periods because there may be customize duplicate videos that V. IST-EFP ALGORITHM
are published by several users if the period is too long. IST-EFP algorithm is an algorithm that able to reduce
dimensional time series dataset about 2.33% [3]. IST-EFP
TABLE I. RECIPES TABLE STRUCTURES algorithm implements intersection of set theory in EFP
(Expand FP-Growth) algorithm. EFP algorithm itself is a FP-
RECIPES Growth algorithm integrated with table on database [7]. The
ATTRIBUTE DATA_TYPE implementation of IST-EFP algorithm in the research is done
Id (PK) NUMBER by implementing PL/SQL Scripts [3, 4, 8, 10, 14, 15].
Title VARCHAR2(128)
Link VARCHAR2(64) IST-EFP(Dataset, minSupCount)
Date_Published DATE 1. X = Dataset
2. X1 = CREATE temporary table FROM X WHERE
Views NUMBER
COUNT(*) > minSupCount
Likes NUMBER 3. Y1 = CREATE EFP table FROM X1
Dislikes NUMBER 4. Z = Y1  X on Y1.previtem IS NOT NULL
Comments NUMBER 5. Return Z
Access_Date DATE
Fig. 3. IST-EFP Algorithm

978-1-5386-9422-0/18/$31.00 ©2018 IEEE


The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia 96
IST-EFP algorithm can be seen on Figure 3. Line number garlic, salt 29 2 41 70.73
3 shows the formation of EFP table while the line 4 is the onion, garlic,
main process of IST-EFP algorithm. bay leaf 21 3 41 51.22
onion, garlic,
salt 22 3 41 53.66

VI. EXPERIMENTS & DISCUSSIONS At the first process, data gathering already done and
stored in database based on structure on Table I and Table
Based on the research method stated, this research will do
II, found 71 ingredients used. At second process, YouTube
3 main processes, each process will follow software cooking recipes processed with association rules algorithm
engineering flows [12] and database design theories [13]. and the support values obtained can be seen on Table III.
Each process will be tested by using black box testing to Table III shows onion and garlic have strongest relation of
make sure the output is valid compared with the manual all ingredients. Table IV shows that garlic and onion have
calculations [9][11]. strongest confidence value. Means that on Indonesia
cooking recipes almost all using onion and garlic together.
TABLE III. SUPPORT VALUES
TABLE VI. IST-EFP’S CONFIDENCE VALUES
SUPPORT TOTAL SUPPORT
ITEMSETS COUNT LENGTH TRX PCT
onion, garlic 30 2 41 73.17 X Y CONFXY CONFYX

onion, bay leaf 22 2 41 53.66


onion garlic 97% 79%
onion, salt 23 2 41 56.1
garlic, bay leaf 22 2 41 53.66 onion bay leaf 71% 92%

garlic, salt 29 2 41 70.73 onion garam 74% 74%


onion, garlic, bay garlic bay leaf 58% 92%
leaf 21 3 41 51.22
garlic salt 76% 94%
onion, garlic, salt 22 3 41 53.66

By applying sql query, garlic and onion found in


the most cooking recipes liked as shown on Table VII.
TABLE IV. CONFIDENCE VALUES TABLE VII. MOST COOKING RECIPES LIKED

X Y CONFXY CONFYX X Y CONFXY CONFYX


onion garlic 97% 79% onion garlic 97% 79%
onion bay leaf 71% 92% onion bay leaf 71% 92%
onion salt 74% 74% onion salt 74% 74%
garlic bay leaf 58% 92% garlic bay leaf 58% 92%
garlic salt 76% 94% garlic salt 76% 94%

On third process, original data processed by IST-EFP


algorithm to reduce its dimension. Original table has 605 VII. CONCLUSION
rows of items and IST-EFP table has 493 rows of items Based on the results obtained from the experiments,
which means 19% of rows are reduced. Rules obtained by Association Rules Mining able to determine the most used
without IST-EFP algorithm is 8750 and with IST-EFP ingredient in Indonesian cooking recipes. The most used
algorithm is 8813 rules (0.7%). Although the dimension of ingredients are garlic and onion, this is a trivial result, but it
IST-EFP reduced, association rules (support and confidence is the most used ingredient of 71 ingredients.
values) are not different from association rules obtained from
original dataset, seen on Table V and Table VI. IST-EFP algorithm able to be implemented on cooking
recipes dataset. Dataset reduced by 19% on dimension and
0.7% on rules without missing on important rules (strong
TABLE V. IST-EFP’S SUPPORT VALUES rules).
SUPPORT TOTAL SUPPORT
ITEMSETS COUNT LENGTH TRX PCT
onion, garlic 30 2 41 73.17 VIII. REFERENCES
onion, bay leaf 22 2 41 53.66 [1] C. Kosla, "YouTube Data Analysis Using Hadoop," California State
onion, salt 23 2 41 56.1 University, California, 2016.
[2] M. Bolanos, A. Ferra and P. Radeva, "Food Ingredients Recognition
garlic, bay leaf 22 2 41 53.66 Through Multi-label Learning," in International Conference on

978-1-5386-9422-0/18/$31.00 ©2018 IEEE


The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia 97
Image Analysis and Processing, Catania, 2017.
[3] B. Siswanto, T. H. Liong and Shaufiah, "Dimensionality Reduction
for Association Rule Mining with IST-EFP Algorithm," in
International Conference on Information and Communication
Technology (ICoICT ), Bali, 2015.
[4] Shaufiah and B. Siswanto, "Association rule mining for identifying
Dengue Hemorrhagic Fever (DHF) and Typhoid Fever (TF) disease
with IST-EFP algorithm," in International Conference on
Information and Communication Technology (ICoICT), Bandung,
2016.
[5] P. Drzymala, S. Wiak and H. Welfle, "Using ORACLE tools to
generate Multidimensional Model in Warehouse," Przegląd
Elektrotechniczny, vol. 88, pp. 257-262, 2012.
[6] R. Shrivastava, G. K. Saxena and K. Patidar, "Implementation Of
Unified Query For Big Database Using Sql On Oracle Platform,"
International Journal of Current Innovation Research, vol. 3, no.
03, pp. 616-621, 2017.
[7] S. Xuequn, S. Kai-Uwe and G. Ingolf, "SQL Based Frequent Pattern
Mining with FP-growth," in 15th International Conference on
Applications of Declarative Programming and Knowledge
Management, and 18th International Conference on Workshop on
Logic Programming, Potsdam, 2004.
[8] S. Fuerstein and B. Pribyl, Oracle PL/SQL Programming, Fifth
Edition, California: O’Reilly Media, Inc., 2009.
[9] S. R. Jan, S. T. U. Shah, Z. U. Johar, Y. Shah and F. Khan, "An
Innovative Approach to Investigate Various Software Testing
Techniques and Strategies," International Journal of Scientific
Research in Science, Engineering and Technology, pp. 682-689,
2016.
[10] V. K. Myalapalli and B. L. R. Teja, "High Performance PL/SQL
Programming," in International Conference on Pervasive
Computing (ICPC), 2015.
[11] Prabhjot and N. Sharma, "Overview of the Database Management
System," International Journal of Advanced Research in Computer
Science, pp. 262-269, 2017.
[12] R. S. Pressman, Software Engineering A Practitioner's Approach 5th
Edition, New York: McGraw-Hill, 2001.
[13] R. Ramakrishnan and J. Gehrke, Database Management Systems
Second Edition, New York: McGraw-Hill College, 2000.
[14] D. Saisanguansat and P. Jeatrakul, "Optimization Techniques for
PL/SQL," in Fourteenth International Conference on ICT and
Knowledge Engineering, 2016.
[15] M. Q. Memon, H. Jingsha, Aasma, A. Ditta and K. G. Rana,
"Dynamic Integration of PL/SQL for Complex Queries,"
International Journal of Database Theory and Application, vol. 9,
no. 10, pp. 351-362, 2016.

978-1-5386-9422-0/18/$31.00 ©2018 IEEE


The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia 98

Vous aimerez peut-être aussi