It is very common to study the patient's data hospitalization to get useful information to improve the health care system. According to the American Hospital Association, in 2006, over $30 billion was spent on unnecessary hospital admissions. If patients that are likely to be hospitalized can be identified, the admission will be avoided as they will get the necessary treatments earlier. In this context, in 2013, the Heritage Provider Network (HPN) launched the $3 million Heritage Health Prize in order to develop a system that uses the available patient data (health records and claims)to predict and avoid unnecessary hospitalizations. In this work we take this competition data, and we try to predict the patient's hospitalization number. The data encompasses more than 2,000,000 of patient admission records over three years. The aim is to use the data of the ?rst and second year to predict the number of hospitalization of the third year. In this context, a set of operations mainly: data transformation, outlier detection, clustering, and regression algorithms are applied. Data transformation operations are mainly: (1) As the data is big enough to be processed, dividing the data into chunks is mandatory. (2) Missing values are either replaced or removed. (3) As the data is raw and cannot be labeled, different operations of aggregation are applied. After transforming the data, outlier detection, clustering, and regression algorithms are applied in order to predict the third year hospitalization number for each patient. Results show, by applying directly regression algorithms, the relative error is only 79%. However, by applying the DBSCAN clustering algorithm followed by the regression algorithm, the relative error decreased to be 67%. This is because the attribute that has been generated by the pre-processing clustering step helped the regression algorithm to predict more accurately the number of hospitalization; and this is why the relative error has dropped. The relative error can be decreased more if we apply the clustering pre-processing step twice. That means, the clusters generated in the first clustering step are re-clustered to generated sub-clusters. Then, the regression algorithm is applied to these sub-clusters. The relative error dropped significantly from 67% to 32%. Patients share common hospitalization history are grouped into clusters. This clustering information is used to enhance the regression results.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error