Heart disease is a serious threat to human health. By analyzing how features from physical test data correlate with heart disease, you can build models to help predict and prevent it.
Prerequisites
-
You have created a workspace. For more information, see Create and manage workspaces.
-
You have associated MaxCompute resources with your workspace. For more information, see Create and manage workspaces.
Data mining procedure
Heart disease prediction
-
Go to the Machine Learning Designer page.
-
Log on to the PAI console.
-
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
-
In the left-side navigation pane, choose .
-
Build the pipeline.
On the Designer page, click the Preset Templates tab.
In the Heart Disease Prediction section of the template list, click Create.
In the Create Pipeline dialog box, configure the parameters. You can use the default settings.
The Data Storage parameter specifies the OSS bucket path for storing temporary data and models generated during the pipeline run.
Click Confirm.
The pipeline is created in about 10 seconds.
In the pipeline list, find the Heart Disease Prediction pipeline and click Open.
Designer automatically builds the pipeline from the preset template, as shown in the following figure.

Area
Description
①
Data preprocessing involves denoising data, filling in missing values, and transforming data types. Because each patient is either healthy or has heart disease, heart disease prediction is a classification problem. This pipeline's input data includes 14 feature columns and one target column. For more information about the fields, see Appendix: Heart disease dataset. During data preprocessing, string fields are converted to numeric types based on their meaning:
Binary data: For a binary field such as sex, which has values of female or male, you can map 0 to female and 1 to male.
Multi-value data: For a field like cp (chest pain type), you can map the types to distinct numeric values.
The following code is an example of a SQL script for data preprocessing.
select age, (case sex when 'male' then 1 else 0 end) as sex, (case cp when 'angina' then 0 when 'notang' then 1 else 2 end) as cp, trestbps, chol, (case fbs when 'true' then 1 else 0 end) as fbs, (case restecg when 'norm' then 0 when 'abn' then 1 else 2 end) as restecg, thalach, (case exang when 'true' then 1 else 0 end) as exang, oldpeak, (case slop when 'up' then 0 when 'flat' then 1 else 2 end) as slop, ca, (case thal when 'norm' then 0 when 'fix' then 1 else 2 end) as thal, (case status when 'sick' then 1 else 0 end) as ifHealth from ${t1};②
Feature engineering includes deriving new features and scaling existing ones. This pipeline first uses the Type Conversion component to convert input features to the DOUBLE type, as required by the logistic regression model. Then, the Feature Select Runner component evaluates the impact of each feature on the outcome, using information entropy and the Gini coefficient. Additionally, the Normalize component scales the numeric range of each feature to [0, 1] to eliminate the influence of different units of measurement. The formula used is
result=(val-min)/(max-min).③
Model training and prediction:
The Split component divides the dataset into a training set and a prediction set at a 7:3 ratio.
The Binary Logistic Regression component trains the model.
NoteIf you need to export the model as a PMML file, select the Generate PMML checkbox on the Field Settings tab of this component. Then, click a blank area of the canvas and configure the pipeline data storage path on the Pipeline Attributes tab.
The Predicted component uses the model and the prediction set as input to generate results.
④
The Confusion Matrix and Evaluate components evaluate the model.
Run the pipeline and view the output.
Click
at the top of the canvas.After the pipeline finishes running, right-click the Binary Logistic Regression component on the canvas and choose to export the trained model.
Right-click the Predicted component on the canvas and choose to view the prediction results.
View the model performance.
Right-click the Evaluate component on the canvas and click Visual Analysis.
In the Evaluate dialog box, click the Indicator Data tab to view the evaluation metrics.
An AUC value over 0.9 indicates excellent predictive performance.Right-click the Confusion Matrix component on the canvas and click Visual Analysis.
In the Confusion Matrix dialog box, click the Summary tab to view statistics, such as model accuracy.
Appendix: Heart disease dataset
This pipeline uses an open-source dataset from UCI containing physical test data from 303 patients at a facility in the United States. The fields are described below.
Field name | Type | Description |
age | STRING | The age of the patient. |
sex | STRING | The gender of the patient. Valid values: female or male. |
cp | STRING | The chest pain types, ordered from most to least painful, are typical, atypical, non-anginal, and asymptomatic. |
trestbps | STRING | Resting blood pressure. |
chol | STRING | Serum cholesterol. |
fbs | STRING | Fasting blood sugar. If the blood sugar level is greater than 120 mg/dl, the value is true; otherwise, the value is false. |
restecg | STRING | Resting electrocardiographic results. The preprocessing script uses values such as norm (normal) and abn (abnormal). |
thalach | STRING | Maximum heart rate achieved. |
exang | STRING | Exercise-induced angina. true indicates presence; false indicates absence. |
oldpeak | STRING | ST depression induced by exercise relative to rest. |
slop | STRING | The slope of the ST segment of an electrocardiogram (ECG), which can be down, flat, or up. |
ca | STRING | Number of major vessels colored by fluoroscopy. |
thal | STRING | The issue types, in ascending order of severity, are norm, fix, and rev. |
status | STRING | Indicates whether the subject is sick. buff indicates healthy and sick indicates sick. |
This dataset is built into the pipeline created from the template. To download the dataset or learn more about it, see Heart Disease Data Set.
Next steps
If the pipeline results meet your expectations, you can deploy the model as an online service for real-time inference. For more information about deployment, see Deploy a model as an online service and PMML-based model deployment.