Fetal Health Project

subject academy

“Predicting the fetal health by the analysis of the different fetal health conditions to reduce the health risks associated with the newborn babies.”

1           Introduction


Infant mortality, the death of an infant during childbirth is one of the, if not the most painful experience that human beings can ever experience. It has been the mission of medical science since times immemorial and it has been their effort to reduce infant mortality to the greatest extent possible.

With each era, as medical science has progressed in leaps and bounds, one of the main areas of focus for the field has been to improve the health of pregnant woman and their new-born babies. And along with the advance in human-kind’s knowledge and skill in the field, infant mortality along with normal mortality has steadily decreased over the centuries. It is interesting to note that while average life expectancy was a little over 30 years in the pre-modern world, since the age of enlightenment, in the last 300 years, average life expectancy has more than doubled and is now at an average of 70 years. This is nothing short of an amazing win for medical science.

On further research, we come to find that while overall human mortality has been recorded very concisely over the last 300 years, data for infant mortality in periods prior to the 20th century is not readily available. However, the 20th century has seen incredible advances made by medical science, arguably at a rate way higher than any in recorded human history. In Figure 1, below, is a graphical representation of the infant mortality rates of the world recorded between 1990 and 2019 by ourworldindata.org, a non-profit that works with sociological record keeping.

subject academy

Figure 1 – Infant Mortality Rates vs Year

We can see in Figure 1 that medical science has done its fair share in helping mankind go through healthier pregnancies and produce healthier children. Only in the last twenty years as Figure 1 demonstrates, Infant mortality has decreased from over 6% to less than 3% worldwide.

Now, standing in the second decade of the 21st century, reducing child mortality is a key Sustainable Development Goal for the United Nation. The United Nation has set an expectation that by 2030, countries should be able to end the deaths of newborns and children who are below the age of 5, with countries within the United Nations charter aim to reduce under‑5 mortality to lesser than 25 deaths per 1,000 children being born (Frieden et al., 2020).

The 21st century along with massive leaps in electronics and technology has witnessed a previously unimaginable boom in personal and business data. Along with the increase in data came hand in hand, technologies to help churn this data into meaningful information. Machine Learning is one such field which has yielded incredible results with data especially in the field of predictive analytics. It is only natural now that healthcare leverages machine learning to further curb infant mortality, which brings us to this particular project.

The health of fetuses have long been monitored by a test called cardiotocograms since the 19th century where fetal heart rate and uterine contractions are measured amongst other things via the use of ultrasounds. The test works by sending ultrasound pulses into the uterus and measuring the response, thus revealing uterine contractions, fetal heart rate (FHR), fetal movements and various other factors (Maeda, 2015). Visual analysis of cardiotocograms have been proven to have well-demonstrated poor reproducibility. Automatic analysis has now presented a way of getting over this problem, as it generates for users a quantifiable and overall consistent report (Edwards, 2007). Systems run on computers allow parameter evaluation and study of features that cannot be assessed properly by the human eye, such as short-term variability (Limsukhawat, 2017). It provides also an organized nondegradable and easily accessible means of tracing transmission, storage and review. Finally, the computerized systems meaningfully modify the creation of databases, with important research and clinical applications.

The primary goal of the project is to develop a classification machine learning model which will enable the health care providers to classify fetal health using data points recorded during cardiotocograms (CTG).

The data set that we are going to be using to develop the classification model here contains 2126 records of details about fetuses collected by cardiotocograms (CTGs) performed on pregnant women in their third trimester (Limsukhawat, 2017). This information helps professionals in the healthcare industry to take appropriate steps in order to make sure child and maternal mortality do not occur. This data set also includes classifications made by 3 expert obstetricians on fetal health status: Normal, Suspect, and Pathological.

2     Literature Review

In the past decade, extensive work has been done in applying machine learning techniques to study fetal health and come up with measures to improve the health of unborn babies. Since Cardiotocography is the single most commonly performed test on pregnant women to determine fetal health, this research paper is not the first time that machine learning algorithms have been applied to study and classify cardiotocograms. There were three major fields within fetal healthcare which have received a great amount of attention from machine learning developers throughout the globe with multiple papers published.

  • Detection and Prevention of Fetal Hypoxia
  • Early detection of Congenital Anomalies
  • Cardiotocography Analysis for determining Fetal Health

Krishna Mohan Mishra from the National College of Ireland published a paper in 2016 called ‘Application of Machine Learning Techniques to classify Fetal Hypoxia’ in which the author closely studies Cardiotocography results, performs feature extraction and then tests multiple machine learning models to classify cases of pregnant women into two classes – Normal (No Hypoxia noticed in Fetuses) and Abnormal (Hypoxia detected in the fetus). Doctors often avoid normal deliveries and perform C-sections in cases where Fetal Hypoxia is noticed. Artificial Neural Networks and Support Vector Machine algorithms managed to secure the highest accuracy in predicting the presence of fetal hypoxia with accuracies of 98.51% and 97.92% achieved respectively in this study.

Congenital anomalies are detected in between 1% and 3% of all child births in the world. About 60–70% of the anomalies can be diagnosed via ultrasonography, while the remaining 30–40% can be diagnosed after childbirth. In their work Computer Methods and Programs in Biomedicine, published in 2018, Akhan Akbulut, Egemen Ertugrul and Varol Topcu used the clinical dataset of 96 pregnant women and used to process data to predict fetal anomaly status based on the maternal and clinical data. The dataset was obtained through maternal questionnaire and detailed evaluations of 3 clinicians from RadyoEmar radiodiagnostics center in Istanbul, Turkey. In this paper, the highest accuracy of prediction was displayed as 89.5% during the development tests with Decision Forest model.

Zahra Hoodbhoy, Mohammad Noman, Ayesha Shafique, Ali Nasim, Devyani Chowdhury and Babar Hasan in their paper ‘Use of Machine Learning Algorithms for Prediction of Fetal Risk using Cardiotocographic Data’, published in 2019 discuss the efficacy of various machine learning techniques in being able to accurately predict the condition of fetal health. In their experiment, they applied 10 different machine learning techniques on the given dataset. On running the models with training data set, the classification predictive models generated by extreme gradient boosting, random forest and decision tree had high precision (>96%) in predicting the state of the fetus which were threatening based on the Cardiotocogram readings. Given below in Figure 2 is a representation of the performance achieved by various machine learning algorithms in their experiment.

subject academy

Figure 2 – Efficacy of Various Machine Learning Techniques in predicting fetal health

In conclusion, although cardiotocography results have been studied elaborately in the last decade and multiple classification machine learning algorithms applied on them, we will attempt to develop on techniques already used to produce classification algorithms which are more accurate than ever before.  In our study, we will be limited to Decision tree, Random forest and Naïve Bayes techniques in classifying fetal health. In the studies mentioned above, feature selection processes and results weren’t clearly covered in the published reports, so we will select our own techniques to select or reject features.

3         Data Summary

The data set we will be using in this project to develop the classification model here was sourced from the University of California Irvine Machine Learning Repository. It contains records of 2126 women who were pregnant and in the third trimester of their pregnancy. The data set consists of 21 attributes useful in the measuring the Fetal Health Rate and uterine contractions (Uterine Contractions) on Cardiotocograms. In accordance to the criteria defined at the National Institute of Child Health and Human Development, the metric which is in use while deriving the status of fetus core risk includes qualitative measurements and quantitative measurements of the Fetal Health Rate (i.e., baseline variability; baseline heart rate; number of early, late, and variable decelerations per second; number of accelerations per second; sinusoidal pattern and the numbers of decelerations prolonged per second) and Uterine Contractions (i.e., duration ,baseline uterine tone, strength and contraction frequency). The Cardiotocograms of the women who were pregnant were classified by three experts who specialize in obstetrics and their interpretation is universally accepted as the gold standard. The fetal Cardiotocograms were extracted from the software SisPorto 2.0 (from Speculum, Lisbon, Portugal), a software designed for automatic analysis of Cardiotocograms.

3.1 Target variable

The target variable is the column ‘fetal_health’ in the data set which is a categorical variable with three categories, denoted as 1-Normal, 2-Suspect, and 3-Pathological. This classification was done by expert obstetricians after examining the predictor variables associated with each pregnancy in the training data set. Our aim is to create a machine learning model which accurately predicts this variable on being fed with the predictor variables.

        3.2   Predictor variables

The predictor variables each measure various metrics collected via Cardiotocograms (CTGs) on pregnant women in their third trimester. The metrics were calculated by projecting ultrasound pulses and measuring their responses, hence illuminating on the fetal movement, fetal heart rate (FHR), uterine contractions and more.

The 21 predictor variables all contain numeric data, each existing in separate scales and ranges, all recorded from the CTGs conducted.

Below in Table 1 is a list of necessary attributes recorded from cardiotocogram and are used in the models:

Variable symbolVariable description
LBFetal heart rate baseline (beats per minute) of fetus
ACNumber of accelerations per second of fetus
FMNumber of fetal movements per second of fetus
UCNumber of uterine contractions per second of fetus
DLNumber of light decelerations per second of fetus
DSNumber of severe decelerations per second of fetus
DPNumber of prolonged decelerations per second of fetus
ASTVPercentage of time with abnormal short-term variability of fetus
MSTVMean value of short-term variability
ALTVPercentage of time with abnormal long-term variability of fetus
MLTVMean value of long-term variability of fetus
WidthWidth of FHR histogram of fetus
MinMinimum of FHR histogram of fetus
MaxMaximum of FHR histogram of fetus
NmaxNumber of histogram peaks of fetus
NzerosNumber of histogram zeroes of fetus
ModeHistogram mode of fetus
MedianHistogram median of fetus
VarianceHistogram variance of fetus
TendencyHistogram tendency of fetus
Fetal HealthFetal state class code (1=Normal, 2=Suspected, 3=Pathological) of fetus

Table 1 – Set of Predictor Variables and Target Variable

4      Methodology

 4.1 Machine Learning Methodology

In order to create a piece of technology which can predict fetal health on the basis of data points recorded during cardiotocograms, we are going to leverage on models based on machine learning algorithms. Covered below are the machine learning techniques that we will use in this project.

4.1.1  Decision Tree

Decision Trees are mainstream data-mining and decision-making techniques which are a kind of supervised machine learning algorithm. A decision tree is like a tree like diagram via which one can represent the statistical probability of an event occurring. They are also useful in representing and finding the course of happening, action, or the result of an event. A decision tree example always makes the picture more lucid and makes it easier to understand the concept.

The branches in the diagram of a decision tree show likely outcomes, possible decisions that can be made at any interjection, or reactions. The last node at the end of the decision tree displays the outcome or result (Hegelich, 2016).

The Decision tree algorithm when used to make a prediction on any dataset, uses a metric called entropy to determine which column to begin splitting first in order to make decisions. Entropy can be defined as the variance of any particular variable (Dong and Wang, 2009). The larger the number of distinct values contained in a variable, the higher its entropy. Decision Trees begin splitting the variable with the highest entropy first and then make decisions through associated values in the other variables.

After the decision making begins, when arriving at any junctions were a decision needs to me made, the decisions are made on the basis of a metric called Information Gain. The path in a branch which has given the highest Information Gain is the optimal decision that the algorithm takes. Any decision that decreases the entropy of a variable is said to have Information Gain. So, a process of repeated decreasing of the variance in the combination of observations in a dataset eventually lets us arrive to the final prediction. At any split Information gain is calculated by subtracting the sum of weighted entropies of the split branches from the original entropy (Cao, Ge and Feng, 2014).

The decision tree is a diagrammatic representation of making decisions, leveraging the statistical concept of probability. These diagrams are known as decision trees since the various branches of the flow chart of the diagram are spread out in a pattern that very closely resemble the branching structure of a tree (García Márquez, Segovia Ramírez and Pliego Marugán, 2019). Different branches of the tree give rise to different outcomes or decisions because of varied starting variables and different information gains that are created.

It is a very reliable tool for decision making and has been used around the globe for making decisions about complex investing and financing issues, healthcare, education, even issues dealing with the personal lives of people (Bigby et al., 2017).

The concept of the decision tree is more useful in a situation that involves series of decisions with a number of outcomes at each step of the decision making. The strength of the decision tree is the division of the decision-making process into chunks that provide a base for analytical thinking and assertive exercise to come up with the most suitable decision easily. Sometimes, variables of the decision are dependent on each other (McCaffery, Irwig and Bossuyt, 2007).

4.1.2  Random Forest

Random Forest was trademarked by Leo Breiman and Adele Cutler in 2001. The Random Forest algorithm is an ensemble learning method, it uses multiple decision trees and combines the decisions made by the different predictive models and chooses the most optimal prediction of all of the models which result in increasing the prediction accuracy (Mangal and Shankar, 2014). Random Forest constructs multiple single Decision Trees and then combines the decision of each tree and spits out the weighted average of the predicted class by each decision tree.

Decision Tress came to be widely accepted across many industries and became a standard go to predictive model for a range of classification problems. Random Forest became like the king of classification predictive models and enjoyed that status until very recently when Extreme Gradient Boosting algorithms began to outperform Random Forests in classification accuracy. This is the case since Random Forests are extremely flexible and can cater to the predictive needs of a range of varied datasets with minimal adjusting (Gower, 2013).

The Random Forest Algorithm uses a technique called Bootstrap Aggregation or Bagging. Consider we have a training set which is X = x1,x2,x3,x4,x5,…, xn with its target variable Y = y1,y2,y3,y4,y5,…, yn, during the training of the model, Bagging continuously selects a sample at random and then replaces it with another record from the training set and fits it to a decision tree. This process occurs B number of times.

The bootstrapping procedure helps improving the performance of the Random Forest prediction model as it decreases the variance of the model without having any increasing effect on the bias. This basically means that while the individual decision trees in the random forest algorithm are highly susceptible to noise in the training data, as long as the decision trees are not too strongly correlated, the average of decisions made by many trees is very slightly influenced by any noise in the training data set. If a deterministic algorithm were used in a Random Forest, generating multiple decision trees from the same data set would give rise to extremely correlated decision trees, sometimes it would even produce the same decision tree over and over again. Bootstrapping is a method to de-correlating the resulting decision trees in a Random Forest by feeding them different training sets (Kim, 2019).

Additionally, we can also measure the uncertainty of the prediction made by a Random Forest by measuring the standard deviation of all the predictions made by the different decision trees.

Commonly, a few hundred to a few thousand decision trees are used in a random forest, depending on the size and nature of the training data set. A good way to find the optimal number of decision trees needed to tackle a particular classification task via random forest is to use cross-validation or observing the out-of-bag error, which is the mean prediction error of a sample when only using decision trees which didn’t have the sample in their bootstrapping batch. The error on both the training data set and the test data set generally trails away after a few trees have been fit.

4.1.3 Naive Bayes

This classifier is based on the Bayes Theorem with a strong (naive) assumption of independence.

There are some benefits of using the Naive Bayes models. They are robust to noise as the probabilities are calculated from all the data. Further, the models work well if there are irrelevant at tributes as they do not impact the posterior probability calculation. However, as the model assumes independence between the predictor variables, it does not perform well if the attributes are correlated.

4.2   Methodology – Methods & Transformations

The dimensionality reduction techniques and variable transformations applied are summarized are described in detail below.

4.2.1 Dimensionality reduction

The default set of predictor variables comprise of 21 numeric variables which together determine the target variable, i.e. Fetal Health. On performing basic exploratory data analysis, no strong linear relationship was noticed between any of the predictor variables and the target variable. Below are graphical representations of three predictor variables – Baseline Heart Frequency, Prolonged Decelerations and Percentage of Long-Term Variability and their relationship to the target variable – Fetal Health. Since no immediate relationship is noticed between the predictor variables and the target variable, we perform a Chi Square test to select predictor variables which are highly associated with the target variable to feed into our model.

4.2.2 Chi Square Test

The Chi-square test for independence, sometimes also referred to as the Pearson’s chi-square test is used in a myriad of fields and industries including science, economics and finances. The Chi-square test displays how two or more sets of data are independent of each other. .A high Chi-Square value indicates that the two variables are dependent on each other. In other words, the higher the value of the Chi-Square of a feature compared to the target variable is, the better it is suited to be selected for training of the model.

Below in Table 2is the result of performing the Chi Square test on the 21 predictor variables vs the target variable:

Feature NameChi Square P Value
baseline value8.73

Table 2 – P values from Chi Square test of Predictor Variables

Based on the results found, we select an arbitrary threshold value of 5. All features with chi square value less than 5 are eliminated from the data set and not fed into the models. Hence the columns ‘fetal_movement’, ‘histogram_max’, ‘histogram_number_of_peaks’ and

‘histogram_number_of_zeroes’ are removed from the dataset. All variables selected for the use in the model are highlighted in green.

      4.3 Normalization

        4.3.1 Min-Max by record normalization

Since Machine Learning algorithms are extremely sensitive to scale, it can easily be thrown off if multiple columns exist in different units and exist over different range of numbers. The practice of min-max normalization reduces each value in a column to a value between 0 and 1. The maximum value in the column is assigned a 1, while the minimum value in the column is reduced to 0. All other values in the column are assigned a value in the scale of 0 to 1. We perform min-max normalization on the entire dataset.

For example, consider we have two columns height and weight. Height is measured in meters while weight is measured in kilograms. If we were to feed these columns as they were into a machine learning algorithm, the model performance would be quite poor. In order to improve performance, once min-max normalization is performed on the columns, both columns will be filled with numbers between 0 and 1 and can be fed to the model now.


 Finally, with the exploratory data analysis, dimension reduction and min-max normalization complete on the data set, we are ready to fit it into machine learning models and measure the efficiency and accuracy with which the models predict the target variable.

However, before we do that, since in the given data set, we do not have a separate test data set, we will first perform a train test split on the training data set. We will use the training data to train the models and then use the test data to measure the accuracy of the model’s prediction.

      4.4 Train Test Split

We use scikitlearn’s train_test_split package in python to split the 2126 into two groups. We reserve one third of the data set as test data and use the other two third to train the models.

        4.5   Methodology –Measurement of Accuracy

Classification problems in Machine Learning have a range of various accuracy metrics that are used to measure the prediction performance of models. A vast number of classification problems being binary classifications, the most common form of classification accuracy measurement is done via confusion matrices.

For example, consider a model that is trying to classify if a group of patients have cancer or no. 1 denotes the patient has cancer while 0 denotes the patient is cancer free in the target column.

In that case, a confusion matrix would be defined as below in Table 3:


Table 3 – Confusion Matrix

A confusion matrix for our classification problem would be a little more complicated since our classification problem has multi-class categories in the target column. Since we have three classes of categories in the target column, the dimensions of the above matrix would have to be multiplied by 3. So instead of 4 boxes in the matrix, our matrix would have 12 boxes.

Generally, accuracy for multi-class classifications is defined as the average number of correct predictions made by the model:


where I is the indicator function, which returns 1 if the classes predicted match the correct class to which it belongs to and 0 otherwise.

In case we want the model to be more sensitive to any individual class, weights can be assigned to every class such that ∑|G|k=1wk=1∑k=1|G|wk=1.

The greater the value of wk for any individual class, the higher the effect it has on the observations from that class.

The weighted accuracy is measured by:

weighted accuracy=∑k=1|G|wi∑x:g(x)=kI(g(x)=g^(x)).

To assign equal weights to each class, we can set: 



In our project, we use the function ‘accuracy_score’ from python’s scikitlearn. Accuracy score measures the number of predictions made correctly by the model and divides it by the number of samples to give us the overall accuracy score of the model.

        Exploratory Data Analysis


In the given data set, below is the distribution of the three classes in the target variable ‘fetal_health’, i.e.1-Normal, 2-Suspect, and 3-Pathological cases in Figure 3:

subject academy

Figure 3 – Distribution of different Classes of the Target Variable in the data set

     Model Application and Summary of Results


We now fit the training data set into three Machine Learning Models and verify the results of the trained model on the test dataset. Below are the confusion matrixes of the results achieved by each model, followed by the specificity and sensitivity for each class in the classification problem.

  1. Decision Tree
Accuracy for Decision Tree= 0.9145
  • Random Forest
Accuracy for Random Forest = 0.9387
  • Naïve Bayes
Accuracy for Naive Bayes= 0.7378

A graphical representation of the comparative accuracy achieved by the machine learning models is below in Figure 4:

subject academy

Figure 4 – Accuracy comparison of Machine Learning Algorithms

 7 Discussion       


On analysis of the results found in this project, we can see that ensemble methods like random forests far outperform all other forms of classification algorithms used in this project. Even without an ensemble method, decision trees by itself perform very well in this project. This clearly goes to tell why random forests and decision trees have long been the go to algorithm for machine learning experts in tackling classification problems as they are superior to other algorithms like naïve bayes and logistic regression in tackling similar tasks.


8        Conclusion

In conclusion, we find that the results found in this experiment are congruent with the results published in the ‘Use of Machine Learning Algorithms for Prediction of Fetal Risk using Cardiotocographic Data’, in 2019. Random Forest and Decision Tree models perform very well with the data set with accuracy scores of greater than 90%, while Naïve Bayes doesn’t perform as well with an accuracy score of less than 75%.

Future work would entail experimenting with different dimension selection techniques to choose different sets of predictor variables to feed into the model. Also what would be most interesting is to use gradient boosting machine learning algorithms and see how they form in tackling this multiclass classification problem.



Bigby, C., Douglas, J., Carney, T., Then, S., Wiesel, I. and Smith, E., 2017. Delivering decision making support to people with cognitive disability – What has been learned from pilot programs in Australia from 2010 to 2015. Australian Journal of Social Issues, 52(3), pp.222-240.

Cao, Z., Ge, Y. and Feng, J., 2014. Fast target detection method for high-resolution SAR images based on variance weighted information entropy. EURASIP Journal on Advances in Signal Processing, 2014(1).

Dong, G. and Wang, X., 2009. Application of decision tree construction algorithm based on decision classify-entropy. Journal of Computer Applications, 29(11), pp.3103-3106.

Edwards, J., 2007. Crocs health and safety fears have not been proven. Nursing Standard, 22(2), pp.33-33.

Frieden, T., Cobb, L., Leidig, R., Mehta, S. and Kass, D., 2020. Reducing Premature Mortality from Cardiovascular and Other Non-Communicable Diseases by One Third: Achieving Sustainable Development Goal Indicator 3.4.1. Global Heart, 15(1), p.50.

García Márquez, F., Segovia Ramírez, I. and Pliego Marugán, A., 2019. Decision Making using Logical Decision Tree and Binary Decision Diagrams: A Real Case Study of Wind Turbine Manufacturing. Energies, 12(9), p.1753.

Gower, J., 2013. Practice nursing is the way to go for masters of many trades. Nursing Standard, 28(7), pp.35-35.

Hegelich, S., 2016. Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events. European Policy Analysis, 2(1).

Hoodbhoy, Z., Mohammed, N., Aslam, N., Fatima, U., Ashiqali, S., Rizvi, A., Pascua, C., Chowdhury, D. and Hasan, B., 2019. Is the child at risk? Cardiovascular remodelling in children born to diabetic mothers. Cardiology in the Young, 29(4), pp.467-474.

Kim, J., 2019. Optimally adjusted last cluster for prediction based on balancing the bias and variance by bootstrapping. PLOS ONE, 14(11), p.e0223529.

Limsukhawat, P., 2017. Quality of Life during Third Trimester of Pregnant Women with Normal Pre-Pregnant Weight and Obese Pre-Pregnant Women by Asia-Specific BMI Criteria. Journal of Gynecology and Womens Health, 2(3).

Maeda, K., 2015. Fetal Heart Rate Changes are the Fetal Brain Response to Fetal Movement in Actoardiogram: The Loss of Fhr Variability is the Sign of Fetal Brain Damage. Journal of Pregnancy and Child Health, 03(01).

Mangal, S. and Shankar, A., 2014. Prediction Improvement using Optimal Scaling on Random Forest Models for Highly Categorical Data. International Journal of Computer Applications, 108(3), pp.40-43.

McCaffery, K., Irwig, L. and Bossuyt, P., 2007. Patient Decision Aids to Support Clinical Decision Making: Evaluating the Decision or the Outcomes of the Decision. Medical Decision Making, 27(5), pp.619-625.

Scroll to Top