Improving Breast Cancer Classification with Python and Visualisations

Breast cancer is a significant health concern worldwide, and early detection plays a crucial role in saving lives. Machine learning can assist in this process by predicting whether a tumour is malignant or benign based on its features. In this blog post, we'll walk through a Python script that accomplishes this using scikit-learn and enhances the analysis with visualizations.

Step 1: Loading the Data

We begin by importing the necessary libraries and loading the breast cancer dataset using scikit-learn:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix

# Load dataset
data = load_breast_cancer()

Step 2: Data Preparation

Next, we organize the data into features and labels:

# Organize our data
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

Step 3: Splitting the Data

To assess our model's performance, we split the data into training and testing sets:

# Split our data
train, test, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=42)

Step 4: Building and Training the Model

We create a Gaussian Naive Bayes classifier and train it with the training data:

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

Step 5: Making Predictions

We make predictions on the test data and evaluate the model's accuracy:

# Make predictions
preds = gnb.predict(test)

# Evaluate accuracy
accuracy = accuracy_score(test_labels, preds)
print(f'Accuracy: {accuracy:.2f}')

Step 6: Visualizing Results

To gain deeper insights, we create a confusion matrix and visualize it using Matplotlib and Seaborn:

# Create a confusion matrix
cm = confusion_matrix(test_labels, preds)

# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_names, yticklabels=label_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

The confusion matrix helps us understand the model's performance by displaying true positives, true negatives, false positives, and false negatives.

Conclusion

In this blog post, we've demonstrated how to build a breast cancer classification model using Python and scikit-learn. By visualizing the results, we gain a clearer understanding of the model's accuracy and its ability to differentiate between malignant and benign tumours.

Early detection of breast cancer is critical, and machine learning models, coupled with visualizations, can assist medical professionals in making informed decisions. This script serves as a starting point for more advanced analyses and the development of predictive models in the field of healthcare.

Remember that while machine learning is a powerful tool, it should be used in conjunction with medical expertise to make accurate and responsible decisions regarding patient health.