Enrollment Prediction Models for WVU using Data Mining
Introduction
Institutional Research
- Institutional research assists colleges and universities in campus decision making and planning
- Collects, analyzes, reports, and has a warehouse of quantitative and qualitative data
Problems
- "Build it and they will come" following WW-II
- By 1990, all but a few colleges were in a marketplace with "hypercompetition"
- Unfamiliar ground for institutions used to having more applicants than their capacity[1]
- National average retention rate is close to 55% and in some colleges fewer than 20%[2]
- Roughly 50% of the students entering an engineering program leave before graduation[3]
"Higher education is transitioning from the enrollment
mode to recruitment mode."
Previous Data Mining Applications in Higher Education
- Neural networks for giftedness identification[5]
- Predicting student performance using data mining with educational web-based system[6]
- Determination of factors influencing the achievement of the first year university students using data mining[7]
- Application of GMDH algorithm for modeling of student's quality[8]
- Predicting persistence of students using data mining methods[9]
- Application of data mining methods to the student's dropout problem[10]
"Suffice it to say that higher education is still
a virgin territory for data mining."
a virgin territory for data mining."
Research Objective
- Build models to predict enrollment using the student admissions data
- Evaluate the models using cross-validation, win-loss tables and quartile charts
- Present theories which can be easily understood by the business users
Tools Used
- Weka-collection of machine learning algorithms for data mining tasks and an open source software
- MS Access-to import the flat files in database format, modifying and creating new fields, and converting Access tables to ARFF using VBA
- CRoss Industry Standard
Process for Data Mining
(CRISP-DM)

(modified from SPSS CRISP-DM picture)
Classifiers
Trees
- Very easy to understand the concept of trees, leaves, and splits
- Results are easy to interpret if the tree is small
- Splits from the root node into leaves
- A few decision tree learners: J48, RandomTree, Random Forest
- Example: Titanic Data
Rules
- Rules like trees can be easy to explain
- These rules are in IF-THEN format
- A few rule based learners: OneR, JRip, RIDOR
- Example:
IF FinancialAid = "Yes" AND
HighSchoolGPA > 3.00 THEN Persistence="Yes"
IF FinancialAid="No" AND HighSchoolGPA < 2.5 AND HoursRegistered < 10 THEN Persistence="No"
IF FinancialAid="No" AND HighSchoolGPA < 2.5 AND HoursRegistered < 10 THEN Persistence="No"
Bayes
- Very fast learner
- Difficult to explain but works well
- Based on Bayes' theorem
Data
Warehouse
- WVU uses SCT Banner system to track students' activities
- An unit in the Office of Information Technology (OIT) run SQL queries to get data in flat files
Extraction
- Admissions data from Spring 1999 to Fall 2006 were used
- There were approximately 3,000 applications for Spring and 25,000 applications for Fall
- 248 attributes- demographical and academic information
Pre-processing
- All the data tables were joined to create a single table.
- Flag variables were modified - Enrollment indicator, First Generation indicator, Accepted indicator
- ACT and SAT scores were combined using concordance tables
- Permanent address Zip codes were used to create a field Median Family Income from using Zip code and Income data from Census.gov website
- Applications which were not accepted were removed-total number of instances 112,390
- Domain knowledge and common sense was used to remove some attributes - email address, phone numbers, etc
- Access table was converted to ARFF using VBA script
- String variables were removed using Weka's pre-processing filter RemoveType –string
Data Visualization
- Enrolled Indicator:51% of accepted applicants enroll
at WVU
- Financial Aid Indicator: 92% of accepted applicants who
received some form of financial aid enroll at WVU
- Residency Indicator:66% of accepted WV residents enroll
at WVU and 62% of accepted non-residents DO NOT
enroll at WVU
Experiment
Feature Subset Selection (FSS)
- Feature subset selection is a procedure in which attributes of a dataset are evaluated for performance
- Number of attributes can be greatly reduced using FSS and hence smaller models can be generated
- In this project, two FSS-Wrapper and InfoGain were used
- Wrapper was used with J48 tree learner and Naive Bayes learner
- Wrapper and InfoGain generated rankings of attribute in order of importance
- These rankings were used for adding attributes in the dataset to evaluate the changes in accuracy on three different learners: J48, Naive Bayes, and RIDOR
- Each learning was cross-validated 10 times
xval
- xval is a script that does the following:
- Randomly divides data in two parts-training and testing
- Applies specified discretizers to the datasets
- Applies specified learners to given datasets <repeat> number of times
- Datasets were equally divided into training and testing dataset
- Number of repeats was 10
- Discretizers used were Nbins and FayyadIrani
- Learners used were JRip, J48, Aode, Bayes and OneR
Results
FSS
- Accuracy was between 83%-84% after the first variable- FinancialAid Indicator
- Two attributes- FinancialAid Indicator and ApplicatonStypCode were selected as detected by Wrapper and a new dataset was created-Data_WRP_NB_J48
- Seven attributes were selected as detected by InfoGain because J48 produced
smaller tree compared to dataset with two attributes and a new dataset was
created- Data_IG
xval
Pivot Table for Dataset created using Wrapper
Pivot Table for Dataset created using InfoGain
- Not much difference in learner's performance in either of datasets
- However, statistically by means of t-tests with 95% confidence J48 with
FayyadIrani is the best learner and OneR with Nbins is the worst learner.
Have a look at this win-loss table

- But this win-loss table do not tell the whole story as it doesn't show
the margin of a win or a loss, which can be seen in this quartile chart
where the vertical bar is the median. OneR is skewed towards negative but
there is not much difference than other learners

Learnt Theory
- J48 Tree with two attributes
- J48 tree with seven attributes and discretized with Fayyad-Irani discretizer had accuracy of 83.84%
- Only three rules were generated by RiDor with accuracy of 83.05%
FinancialAidIndicator = N
| ApplicationStypCode = 0: N
| ApplicationStypCode = A: N
| ApplicationStypCode = B: Y
| ApplicationStypCode = C: N
| ApplicationStypCode = D: Y
| ApplicationStypCode = E: Y
FinancialAidIndicator = Y: Y
Number of Leaves : 7
Size of the tree : 9
Correctly Classified Instances 93448 83.1462 %
EnrolledIndicator = Y
Except (FinancialAidIndicator = N) and (ApplicationStypCode = A) => EnrolledIndicator = N
Except (FinancialAidIndicator = N) and (ApplicationStypCode = C) => EnrolledIndicator = N
Total number of rules (incl. the default rule): 3
Correctly Classified Instances 93349 83.0581 %
Except (FinancialAidIndicator = N) and (ApplicationStypCode = A) => EnrolledIndicator = N
Except (FinancialAidIndicator = N) and (ApplicationStypCode = C) => EnrolledIndicator = N
Total number of rules (incl. the default rule): 3
Correctly Classified Instances 93349 83.0581 %
Goal-II
HSGPA + Non-Resident + Enrolled = High Earnings with High Quality
Students
- Although, Financial Aid helps recruit students it necessarily does not help retention
"No amount of financial aid seems to cause students to
enroll in more terms, take more credit hours and receive degrees."
- Students with high High School GPA have low chances of drop-outs
- Consider a situation where the institution wants to increase the quality and enrollment of students and at the same time desires to increase the earnings in terms of tuition fee dollars by targeting non-resident students
- Three attributes were clubbed together to form a new response variable
HSGPA --> High school GPA (High > 3.3 OR Low <
3.3)
RESIDENCY INDICATOR --> Resident of State of WV (Yes/No)
ENROLLMENT INDICATOR --> Enrolled in WVU (Yes/No)
RESIDENCY INDICATOR --> Resident of State of WV (Yes/No)
ENROLLMENT INDICATOR --> Enrolled in WVU (Yes/No)
- The three attributes have two levels each
- The new response variable will have 2^3 = 8 levels, i.e. 8 classes
- However, only first four classes of interest were considered and others were classified into a fifth class
Goal-III
HSGPA + Non-Resident + No Fin Aid + Enrolled = High earnings
with high quality students who are genuinely interested in WVU.
- Who are the students who have High GPA but did not get financial aid and are still enrolled?
- Could the institution’s objective of increasing earnings and the student quality be helped from these students?
- Four attributes were clubbed together to get dependent variable which will satisfy the new goals of the institution
HSGPA --> High school GPA (High > 3.3 OR Low < 3.3)
RESIDENCY INDICATOR --> Resident of State of WV (Yes/No)
FINANCIAL AID INDICATOR --> Received Financial Aid(Yes/No)
ENROLLMENT INDICATOR --> Enrolled in WVU (Yes/No)
- Four levels of the attributes the new response will have 2^4 = 16 levels, i.e. 16 classes
- However, the instances of students that received financial aid were ignored for this analysis
Experiment
FSS
- CFS subset evaluation and Chi-squared attribute selection techniques were used
- For business goal-II, following attributes were selected
- For business goal-III, following attributes were selected
HighSchoolCode
MedFamilyIncome
FinancialAidIndicator
ApplicationMajorCode1
ApplicationStateCode
ScholarsCode
ACTEQUIV
DepVar
MedFamilyIncome
FinancialAidIndicator
ApplicationMajorCode1
ApplicationStateCode
ScholarsCode
ACTEQUIV
DepVar
HighSchoolCode
MedFamilyIncome
ApplicationMajorCode1
ApplicationStypCode
ApplicationStateCode
ACTEQUIV
DepVar
MedFamilyIncome
ApplicationMajorCode1
ApplicationStypCode
ApplicationStateCode
ACTEQUIV
DepVar
Resampling
- The classes in the for the new goals were highly imbalanced
- A re-sampling policy was adopted to address this problem
- To reduce the complexity of the calculations involved and be able to run the learners without crashing, a 50% sample with bias to uniform class was taken
xval
- A 10-way cross-validation experiment on the prepared data with new goal was designed with three learners, viz. Naïve Bayes, J48 tree learner and OneR (rule based learner)
Results
Experiment Results for Business Goal-II
- Naïve Bayes learner produced good accuracy (66.1%) and PD of 94.9% with very low processor and time consumption
- J48 tree based learner achieved highest PD (99.1%) and accuracy of 66.62%
- OneR produced base accuracy of 56.5% with 97.9% PD
- xval results are here
- Fayyad Irani discretization method produced better results as compared with nbins
- Winloss:
- On an average Naïve Bayes with Fayyad-Irani discretization performed better than J48 and OneR
- OneR produced worst performance among other learners
- Quartile Charts:
- Naïve Bayes and J48 performed equally well on the data
- OneR performed slightly worse
Experiment Results for Business Goal-III
- Even though students who received financial aid were removed, results were very similar to that of business goal-II
- Naïve Bayes learner produced good accuracy (66.1%) and PD of 95% with very low processor and time consumption
- J48 tree based learner achieved highest PD (99.1%) and accuracy of 66.63%
- OneR produced base accuracy of 56.5% with 97.33% PD
- xval results are here
- Winloss:
- On an average Naïve Bayes with Fayyad-Irani discretization performs better than J48 and OneR
- Quartile Charts:
- Naïve Bayes and J48 performed equally well on the data
- OneR performed slightly worse
Conclusions
- Overall, financial aid is the most important factor that attracts students to get enrolled. Regardless of students' High School GPA and ACT/SAT scores they enroll at WVU if they receive some form of financial aid
- Financial aid can be used as a controlling factor for increasing the quality of incoming students
- This research also found that ACT scores, median family income (in a zip code), college (major) and student type determine enrollment of the students
- Even though accuracy achieved for business goal II and III was not high, it is promising and with more study of attributes and other learners can be used to predict classes of incoming students
- Although, J48 tree learner is simple to explain that Naive Bayes, for business goal-II and III, it was very time consuming and generated lots of leaves thus making it difficult to understand. In this case, use of Naive Bayes as a classifier is highly recommended
"Policy makers may not have confidence in a forecast if they do not
understand its conceptual basis or accept its assumptions"
Future Work
- Attributes, such as, distance from the campus and first method of contact, should be created to see their effect
- Although, financial aid is the most significant factor resulting in enrollment, amount of financial aid offered should also be included in the data. So that "bins" can be created on the amount of financial aid offered and then learners can be used for classification
- Even though financial aid helps recruiting students, it does not necessarily retain the students. In order to find attributes affecting retention, enrolled indicator and “persistence indicator” should be combined. Similar experiments would be necessary to find a relationship between the student demographic – academic information and retention
References
- Klein, T. A., Scott, P. F., Clark, J. L. A Fresh Look at Market Segments in Higher Education Planning for Higher Education, v30 n1 p5-19, Fall 2001.
- Druzdzel, M. J. and Glymour C., Application of the TETRAD II program to the study of student retention in U.S. colleges, Working notes of the AAAI-94 Workshop on Knowledge Discovery in Databases (KDD-94), Seatle, WA.,1994
- Scalise, A., Besterfield-Sacre M., et al., First term probation: models for identifying high risk students, 30th Annual Frontiers in Education Conference, Kansas City, MO, USA, Stripes Publishing, 2000.
- Roueche, J., and Roueche, S. High stakes, high performance: Making remedial education work. Washington, DC: Community College Press, 1999.
- Hyuk Kwang, et al.,Conceptual Modeling with Neural Network for Giftedness Identification and Education, Lecture Notes in Computer Science, Volume 3611, pp. 560-538,2005.
- Minaei-Bidgoli, B., et al., Predicting student performance: an application of data mining methods with the educational web-based system LON-CAPA,Proceedings of ASEE/IEEE Frontiers in Education Conference, Boulder, CO: IEEE, 2003.
- Superby, J.F., Vandamme, J-P., Meskens, N., Determination of factors influencing the achievement of the first-year university students using data mining methods, Proceedings of the Workshop on Educational Data Mining at the 8th International Conference on Intelligent Tutoring Systems (ITS 2006), Jhongli, Taiwan, Pages 37-44, 2006.
- Naplava, P. and Snorek N., Modeling of student's quality by means of GMDH algorithms, Modelling and Simulation 2001, 15th European Simulation Multiconference 2001, ESM'2001, Prague, Czech Republic, 2001.
- Luan, J. and Serban, A. M., Data Mining and Its Application in Higher Education, Knowledge Management: Building a Competitive Advantage in Higher Education, New Directions for Institutional Research, Jossey-Bass, 2002.
- Massa, S. and Puliafito P. P., An application of data mining to the problem of the university students' dropout using Markov chains, Principles of Data Mining and Knowledge Discovery, Third European Conference, PKDD'99, Prague, Czech Republic, 1999.
- Sanjeev, A. P. and Zytkow J. M., Discovering enrolment knowledge in university databases, First International Conference on Knowledge Discovery and Data Mining, Montreal, Que., Canada,1995.
- Brinkman, P. and McIntyre, C., Methods and Techniques
of Enrollment Forecasting, Chapter 5 in D. T. Layzell (Ed.), Forecasting
and Managing Enrollment and Revenue. New Directions for Institutional Research,
(No. 93). San Francisco: Jossey-Bass Inc., Publishers. pp. 67-80, 1997.