Académique Documents
Professionnel Documents
Culture Documents
Contents
1 Proposed Supervisor 1
2 Research Area 1
3 Research Background 1
4 Research Aims 1
5 Research Question 2
7 Ethical Issues 2
8 Sources of literature 3
9 Your own expertise and how well you are positioned to carry
out the work 3
2 Research Area
The domain of my research interest can be broadly classified as Exploring en-
semble method named XGBoost and subject of this study is Data Analytics
3 Research Background
eXtreme Gradient Boosting (XGBoost) is an improvement over Gradient Boost-
ing. It’s widely used machine learning method by data scientist to achieve
state-of-the-art results. Tree Boosting gives excellent results on many standard
classification benchmarks. XGBoost provides a parallel tree boosting (GBM)
that solve many data science problems in a fast and accurate way.
The most important factor behind the success of XGBoost is scalability
which is nothing but several vital systems and algorithmic optimisations; these
include use of novel tree algorithm for handling sparse(missing) data, Paral-
lel and distributed computing making learning process faster enabling quicker
model exploration [2].
XGBoost belongs to a broader collection of tools under the umbrella of the
Distributed Machine Learning Community who are also the creators of the fa-
mous mxnet deep learning library [1]. Tianqi Chen and Carlos Guestrin tried to
address the directions such as out-of-core computation,cache-aware and sparsity-
aware learning have not been explored in existing works on parallel tree boosting.
4 Research Aims
The aims for my research are sequentially defined as follows:
1. To understand idea behind XGBoost Understanding scalability term
coined for XGBoost by TChen , understanding the advantages and short-
comings of XGBoost and mathematics behind this ensemble.
2. To implement XGBoost Implement XGBoost with different frame-
works such as R, Python, Spark and different dataset such as imbalanced
dataset, dataset with more missing values.
3. To explain result Since all ensembles are mostly treated as blackbox,
interpreting the result is more important and for XGBoost we have package
in R named ’xgboostExplainer’ which explains
1
4. To compare Gradient Boosting vs XGBoost vs other ensembles
. Major focus will be comparing gradient boosting with XGBoost, under-
standing the shortcoming of gradient boosting and understanding bene-
fits of using XGBoost. Finally, doing a comparison based on performance
among different popular ensembles.
5 Research Question
Considering the background and the current scenarion, we have four concrete
quetions to answer:
Q1. What is idea behind development of coming with XGboost and understand-
ing mathematics?
Q2. How can we implementing XGBoost model with different datasets using
different frameworks?
Q3. What are the results generated by XGBoost and how we can go about ex-
plaining why XGBoost made particular decision?
Q4. Comparing and constrasting Gradient Boosting against XGBoost or XG-
Boost with different ensembles?
7 Ethical Issues
During the research of the given topic, the researcher must act in a proper way
which is he must ethical, and no malpractices are being done during the given
research. All the data which are being collected from the respective Trinity
College Dublin are being kept confidential, and the data is mainly used for the
required research purposes, and there will be no leakage of the respective data.
2
8 Sources of literature
The sources of literature can be broadly classified as:
XGBoost
A Scalable Tree Boosting System
Improvement over gradient boosting
3
(a) STRENGTHS, LIMITATIONS AND RELIABILITY OF THE RE-
SEARCH RESULTS
(b) SCOPE OF FUTURE WORK
11. BIBLIOGRAPHY
12. APPENDICES
References
[1] A gentle introduction to xgboost for applied ma-
chine learning. https://machinelearningmastery.com/
gentle-introduction-xgboost-applied-machine-learning/. Accessed: 2017-
12-31.
[2] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting sys-
tem. In Proceedings of the 22Nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New
York, NY, USA, 2016. ACM.