Vous êtes sur la page 1sur 9

Senior Project Proposal

CollegeProRec: A Recommendation
System
Eish Kapoor

Statement of Purpose
The project goal is to recommend a list of college programs (i.e. Harvard Computer Science,
Harvard Economics, etc.) for a user based on:

1) Matching the user profile

2) Outcomes of past predictions

3) Dynamic information scraped from the public and private data sources

The core technology is a recommendation engine that utilizes machine learning to learn from prior
outcomes and NLP (natural language processing) to leverage online reviews. The motivation from
this program stems from my own college process, where my colleagues and I found it difficult to
wade past university prestige, parents’ and students’ biases, and the internet. The focus of this
project is to return a fit- a school that suits the student academically and culturally. The program
will be much more geared to academic fit, but will try to include some qualitative inputs to gauge
cultural fit. Much of the program’s definition of fit is open to change based on the data. Ideally, my
program will have multiple inputs of data that aptly describe an individual and fit- grades,
extracurriculars, race, gender, test scores, etc. The inputs will change on data availability as well as
further research. The methodology for data will have to be that a profile must encompass most if
not all data inputs the engine would require to return a fit.

Since a recommendation engine that has a database of all colleges in the United States is an
undertaking that cannot be done in twelve weeks, I will be focusing on one “conference” of
schools- Pac-12, SEC, ACC, Ivy, etc. The programs I will encompass will be the top enrolled
programs in every one of these colleges, along with the top dual major accredited programs(i.e.
MET at UCB, M&T at Penn). Since data is a gateway issue, I will be selecting a conference based
on which conference has the most readily available and accessible data. This data will be excavated
and found from numerous sources- released data from colleges, Naviance, CollegeBoard,
AdmitSee, etc.

Background
History of Recommendation
Since my project at its core is a recommendation engine, it’s important to pay heed to the
beginnings. Recommending has always been a part of society, with the earliest traces being in
“The Epic of Gilgamesh,” where Gilgamesh’s advisors tell him he’s crazy for attempting to be
immortal. Royal advisors are the first trace of recommendation, as they are famous around the
world- Rome(Maecenas), India(Birbal), and Russia(Rasputin). They were tasked with informing
the rulers about what steps to take, for the rulers did not have the time nor patience to collect and
comprehend all the information and situations in their land. Another instance in history are sacred
texts and religious figures. Texts like the Bible recommend their followers a specific way of life in
order for the followers to ascertain success in reaching God. The same goes for figures like Luther
and Muhammad, who preached ideals and practices that would have God favor their followers in
life as well as the after-life.

Now, recommendation has become an area of many professions. Therapists, life coaches, and
physical trainers are all examples of this profession, and they are fairly well paying jobs. For

2
example, a marriage and family therapist makes approximately fifty thousand dollars a year. In my
opinion, this increase of recommendation needs coincides with the growth of societal awareness.
As a society, we are more aware of our habits and undertakings than ever- i.e. veganism, recycling,
global warming- that we look for recommendations consistently.

Current Recommendation Ed-Tech


On the market, there are some resources available to the general public that aid in finding colleges
depending on the data inputted. The most used resources are “BigFuture” by College Board and
COLLEGEdata by NACAC. Here’s an analysis into these two engines:

BigFuture
BigFuture has two options: search the college name or use their college search. The college search
has ten main categories: test scores & selectivity, type of school, location, campus & housing,
majors & learning environment, sports & activities, academic credit, paying, support services, and
diversity. They cross-reference your preferences and offered data with their logged information of
3,702 colleges, and then recommend a list of colleges from it.

COLLEGEData
Uses the same two options, but their college search is different in some key aspects. When
referring to entrance difficulty, rather than asking for SAT scores, they ask the user for what level
of difficulty they prefer. There’s also an aspect of freshman satisfaction, which is another term for
retention rate. Last, they also have a section for graduation rate. These differences seem inefficient.
It doesn’t make sense for the student to rate difficulty because this again leaves the process up to
subjectivity, while the freshman satisfaction and graduation rate seem miscellaneous. It’s also
important to note that comparatively, COLLEGEData’s user interface is antiquated to BigFuture.

Here's some overall notes on these two programs:

 Helps narrow down the selection of colleges with fairly legitimate bounds. Most students
seem to have a feel for what they believe as their fit in terms of location, support services,
and paying. That’s a huge part of the fit aspect of the process, and it seems to do a great job
of solidifying and relating to that.
 The test scores & selectivity aspect itself is a good indicator of what schools are attainable,
but heavily range depending on the major. BigFuture’s methodology does not disclose the
information, but what it seems by the process of entering data(the user test case) is that they

3
have not separated the test scores and selectivity by major. From the data released by
College Board in 2013, the mean SAT Score for Engineering was 1620 while Public
Administration is 1372. It’s hugely dependent on the major you indicate interest in if you fit
that major’s average score range.
 Speaking of majors, intended majors are not a great indicator of what the student will major
in college. According to the National Center for Education Statistics, eighty percent of
students change their major once. That’s important to note because most of these current
recommendation engines put an emphasis on intended major.
 Both of these engines recommend colleges, not specific programs. This is important to
note, because there are some schools that have dual major programs as well that would be
excellent fits for interdisciplinary focused students. This need isn’t being addressed.

There are more college recommendation engines out there that I do not have access to, but two are
worth noting- Parchment and AcceptEdge. Parchment raised millions of dollars in venture capital
to create an engine that takes test scores, schools that you “have to apply to”, and extracurriculars
to give probabilities of admission. They’ve decided to take the approach that many investors take
on their stock portfolio- hedging your bets. This approach intrigues me, because it doesn’t to set
your college list. It allows the user to set the goal and then calculates the probability of admission.
They use their in-house data, and since Parchment doesn’t have a large user base, the sample size
is fairly limited. As for AcceptEdge, their approach is drastically more personal. They take a look
at the run of the mill information, but also books the user enjoys, hobbies, and favorite music. It
then uses an extensive data base to give percent chance of admission, ideal schools, and
information on most predicative indicators of admission.

Tools to Create CollegeProRec


There will be a multitude of tools used for the completion of this project, as they range from Weka
Workbench to CSS. First is the backend, or the machinery that will make the program work. It
starts with the Weka Workbench, which is a introductory machine learning program that will be
used to create and mold the algorithm for the program. This might change depending on the
complexity of the algorithm to a more advanced program such as the Python Ecosystem, which
puts us on a different development path. Considering I go with Weka Workbench, it’s streamlined
to work with Java, which will help build the application. Since Java has an adapter for websites, I
can then use that to transfer the algorithm to a site, which I will customize using CSS and other
programs to create a welcoming user interface.

4
Prior Research
 The University of Waterloo released a paper on the application of Machine Learning onto
Recommender Systems. In that paper, they review the holistic application of Bayesian
centered Machine Learning systems that follow decision trees. They point out how decision
tree based ML systems are the most used approach, and that’s due to their relative easiness.
This research is helpful in deciding that the Bayesian approach might be most effective,
because my algorithm should not be overly complicated. It would also be the conservative
approach to assimilating foundational data. They briefly note its possible potency in dealing
with education.
 A paper written by David McAllester, “Case Studies in Machine Learning”, creates a
cohesive plan and introduction into past treks of machine learning. To summarize, he
claims that the Bayesian approach- one stated by Waterloo- is flawed due to its need for
human input, but the most consistent approach known today. There is no delving into the
educational aspect.

This is the research of note, but it essentially boils down to a Bayesian approach being the best
option. I’ll have to do more in depth research of the application of the Bayesian approach, which I
will do in phase one of my schedule.

Significance
This project is significant because of its possible impact on college admissions. It’s expected that
twenty million students enroll in college this fall, and these are users for my expected program.
Logically, a student can only attend one college, so it’s important for the student to efficiently
section their focus on specific colleges that fit their needs. College counselors recommend a
student apply to five to eight colleges, yet there are over three thousand colleges in the United
States. With CollegeProRec, I aim to create the best tool on the market for narrowing down that
list.

Earlier, I listed the intricacies of top similar programs on the market. I hope to build on the
shortcomings of these engines, because I believe that’s where the real value of the recommendation
will arise from. I hope to automate and quantify fit through some important factors:

 Weighing the user’s major choice, but not heavily- A huge aspect of selecting colleges to
apply to is the fit, and that fit comes down to preferred majors. But as I stated earlier, many

5
students change their major throughout their college experience. I hope to find the perfect
emphasis on the user’s major through my program. Here are some options for doing so:
o Quantitative- In a perfect world, there would be data that displayed what majors
students switched from and switched into. This would have to be a large sample
size- machine learning requires that- but it would precedent to the model’s
algorithm.
o Qualitative- This is the less effective approach, but might be necessary. Through
interviewing current college students that I know, I could try to program the
algorithm to follow the insights. There’s a large flaw with this:
 My connections are very biased. Most of them are extremely sure in their
career path, and the career paths are predominantly biology related or
engineering related. I wouldn’t be able to do assimilate much data about the
liberal arts aspect of colleges.
 Quantifying and correlating hobbies/extracurriculars with majors- This is probably the most
difficult automation aspect of this project. Hobbies and extracurriculars are a very
important indication of an individual’s interests and inclinations- but this indication is not
easy to quantify. I’d put it like this- hobbies and extracurriculars give a feel for the person.
But as is with feelings, feelings are inherently qualitative. The same options for major
applies here, but I’ll also be looking into some psychology journals to possibly get a better
understanding of the correlation between hobbies and vocations/majors.
 Programs + Colleges- This is an important wrinkle that I’ve found important to my college
process. Being recommended a specific program is vastly more beneficial for essays as well
as fit, because every program has a different curriculum. Also, these programs usually have
specific essays, so it helps the student get an idea for what essays they would have to write
as well.

It’s also important to note that I’m different from the other engines in the field from perspective
alone. I’m currently an applicant, so I now the test cases as well as the customer segments, because
I am the customer.

On a personal level, I aim to gain insights on data collection and machine learning algorithm
creation from this project. The data this project will deal with is plentiful, so it is offers an ideal
playing ground to create the foundation necessary for my intended future ventures. I plan to create
an educational technology start up in the future, one that recognizes trends in grades to recommend
users solutions and paths to success for their problems. A learning management system of this
magnitude would require large amounts of data processing, so this project would be beneficial in
assimilating the knowledge to create that learning management system.

6
Schedule of the On-Site Experience
I will have my project split up into three different phases: introduction to ML, creation of database,
and finishing the program. In the introduction phase, I will communicate with my off-site mentor
in working on the foundational outlay of the recommendation system, as well as starting Andrew
Ng’s Coursera course on machine learning. Afterwards, I will do some background research on the
psychological effects of recommendations and the overall power of them. This will include the
success rate of recommendations, what occurs when there are successful recommendations, and
success rate pertaining to education settings. This will be beneficial in creating the correct
literature base and also create the foundation for actually creating the product. In the creation of
data base, I will focus on assimilating the correct data for the program to parse through, while
additionally cultivating the Bayesian foundation for the program. ML programs require large
amounts of data to really work, so this will take a great deal of time and sculpting. I will need to
learn Ruby while also teaching it how to look for trends, specifically the trends I’m looking for.
Most of the problems of the program will arise here, as there will be multiple flaws and
comprehension issues. The last phase will consist of brushing up the project and changing the
small intricacies. I will create more complex parts while churning out the real recommendations.
This will also take a great deal of time. For the first few weeks, my goal is to get the Coursera
course down as well as having an established understanding of Python/Ruby. After that, I hope to
have a literature base that gives me a better understanding on how to apply the Bayesian problem
while having 50% of the required data to start creating my Bayesian model. In weeks 5-6, I plan to
have an overall model and a basic website layout for my idea. In this overall model, I want it to
have input and output. In weeks 7-8, I want a working model that takes in numerous inputs and
also have all the data I need for the project.

I’m interning at Contiq, the start up where I also interned at over summer. This extension should
also be done in the substance; I was focusing on Optical Character R ecognition. Optical
Character Recognition is a subset of machine learning that focuses on images of text and then
converts it into editable text. I’ll continue working with the development team at Contiq through
my external advisor, as the same process of teaching the program translates to my concept. My
external advisor Arun Lal is also very well versed in data assimilation and parsing, so his insight
will be especially advantageous in the second phase. I intend to use his expertise in the form of
feedback, as I’ll be running UI(user interface) changes and data techniques past him to improve my
user experience and algorithm. Mr. Lal has also offered to help me by allowing me to communicate
with other Contiq employees, so I’ll also be working with the Contiq development and UI team to
improve my product.

7
Problems
There are many possible obstacles for this project. The first problem is definite- data. Data
assimilation will probably be the most difficult aspect of this project, because there’s no golden
standard for data. The standard and requirements for data will most probably shift as I learn what
data is available to me. As for where to get the data, that will be an issue as well. There’s numerous
data banks on the internet(Naviance, AdmitSee, etc.), but there isn’t homogeneity in the data they
provide. I’ll most likely have to manually insert data in the beginning, but ideally I can find a tool
that scrubs the internet for data. So it’ll be up to me to decide what data is inputted. That leads to
another issue- removing human bias. This will be a perpetual tug of war, as the program requires
input to begin creating itself and a human to inform it in the beginning, but will need to eventually
transcend human biases in its recommendations. There will definitely also be coding problems
among others, as well as difficulties syncing the machine learning programs to Python, Java, and
UI programs. In the endgame, the largest flaw is figuring out if the program’s
recommendations(assuming it works) are correct. That’s something that would be known after four
years, or perhaps never known at all.

My solutions for these problems are not evident as of now, but I do have some ideas for tackling
them. For data assimilation, I’ll be communicating with my school’s college counselors for what
they have found as the most important indicators for fit, as well as fellow students. That’ll help in
setting up the pre-requisites for what data I need. As for the human biases, I’ll be able to check it
by working with my external advisor on ensuring the variables and inputs aren’t being biased. It’ll
take numerous checks by different people to ensure this doesn’t occur. As for the result flaw, I
believe the best approach will be taking old used cases- previous Basis students, faculty, etc.- and
entering their data in. Then I will cross reference their actual decisions with the program’s
recommendations, and use that to decided whether the recommendations are working. It’s also
important to note college decisions are not always rational, but that’s part of the fit. Even though
academically there might’ve been a better fit for someone, the fact they chose another college
means it’s a better fit to them. That’ll be hard to quantify, but is a problem I’m eager to run into.

Bibliography
Gaskell, Adi. “Machine Learning And The Future Of Education.” Forbes, Forbes, 4 Nov. 2016,
www.forbes.com/sites/adigaskell/2016/11/04/machine-learning-and-the-future-of-
education/#37908d3f7693.

8
Ark, Tom Vander. “8 Ways Machine Learning Will Improve Education.” GettingSmart.com, 26
Nov. 2015, www.gettingsmart.com/2015/11/8-ways-machine-learning-will-improve-education/.

Portugal, Ivens. “The Use of Machine Learning Algorithms in Recommender Systems: A


Systematic Review.”

Gillipsie, Brian. “History of Academic Advertisting.” History of Academic Advising, 2003,


www.nacada.ksu.edu/Resources/Clearinghouse/View-Articles/History-of-academic-advising.aspx.

Shieber, Jonathan. “Parchment Launches College Recommendation Engine So Applicants Can


Hedge Their Bets.” TechCrunch, TechCrunch, 23 Apr. 2014,
techcrunch.com/2014/04/23/parchment-launches-college-recommendation-engine-so-applicants-
can-hedge-their-bets/.

Portugal, Ivens, et al. “The Use of Machine Learning Algorithms in Recommender Systems: A
Systematic Review.” Expert Systems with Applications, vol. 97, 2018, pp. 205–227.,
doi:10.1016/j.eswa.2017.12.020.

Hofmann, Thomas. “Kernel Methods in Machine Learning.” The Annals of Statistics, vol. 36, no.
3, 1 June 2008, pp. 1171–1220.

Vous aimerez peut-être aussi