Heart disease detection#
In this tutorial, we will focus on applying federated learning techniques to a classification problem using Scikit-Learn, a popular machine learning library in Python. We will walk you through the process step by step, from setting up the federated learning environment to evaluating the model’s performance.
Scikit-Learn, also known as sklearn, is a popular machine learning library in Python. It provides a wide range of tools and algorithms for tasks such as data preprocessing, feature selection, model training, and evaluation. Sklearn is widely used for tasks such as classification, regression, clustering, and dimensionality reduction. It offers a user-friendly interface and integrates well with other libraries in the Python ecosystem, making it a go-to choice for many machine learning practitioners and researchers.
%load_ext autoreload
%autoreload 2
Table of content#
The dataset
Task 1: training plan
Task 2: the experment
Task 3: model validation
Tutorial#
The dataset #
The Heart Disease dataset available at https://archive.ics.uci.edu/dataset/45/heart+disease is a widely used dataset in the field of cardiovascular research and machine learning. It contains a collection of medical attributes from patients suspected of having heart disease, along with their corresponding diagnosis (presence or absence of heart disease). The dataset includes information such as age, sex, blood pressure, cholesterol levels, and various other clinical measurements.
It was collected in 4 hospitals in the USA, Switzerland and Hungary. This dataset contains tabular information about 740 patients distributed among these four clients.
A federated version of this dataset has been proposed in Flamby. Following thier actions, we preprocess the dataset by removing missing values and encoding non-binary categorical variables as dummy variables. We finally obtain the following centers:
Number |
Client |
Dataset size |
---|---|---|
0 |
Cleveland’s Hospital |
303 |
1 |
Hungarian Hospital |
261 |
2 |
Switzerland Hospital |
46 |
3 |
Long Beach Hospital |
130 |
For teaching purposes, we decided to merge: client0 with client3 and client1 with client2. The final federated scenario, in this way, is the following:
client1, with 349 elements
client2, with 391 elements
Task 1: Defining the training plan #
A training plan is a class that defines the four main components of federated model training: the data, the model, he loss and the optimizer. It is responsible for providing custom methods allowing every node to perform the training.
In the case of scikit-learn, Fed-BioMed already does a lot of the heavy lifting for you by providing the FedPerceptron, FedSGDClassifier and FedSGDRegressor classes as training plans. These classes already take care of the model, optimizer, loss function and related dependencies for you, so you only need to define how the data will be loaded.
In this tutorial we are going to use an SGDClassifier, so the related FedSGDClassifier training plan.
Model arguments#
model_args is a dictionary with the arguments related to the model, that will be passed to the Perceptron constructor.
IMPORTANT For classification tasks, you are required to specify the following two fields:
n_features: the number of features in each input sample (in our case, the number of pixels in the images)
n_classes: the number of classes in the target data
Other model arguments depend on the specific model you are using, and are defined in the model definition. Refer to the model documentation
Training arguments#
training_args is a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.
IMPORTANT To set the training arguments we may either pass them to the Experiment constructor, or set them on an instance with the setter method:
'exp.set_training_arguments(training_args=training_args)'
The setters are available also for single training arguments, like:
'exp.set_aggregator(aggregator=FedAverage)'
TO_DO:
Apply the scaler to your data
Define training args: num_updates, batch_size.
Define model args as explained above.
from fedbiomed.common.training_plans import FedSGDRegressor, FedPerceptron, FedSGDClassifier
from fedbiomed.common.data import DataManager
from sklearn.preprocessing import MinMaxScaler
class SkLearnClassifierTrainingPlan(FedSGDClassifier):
def init_dependencies(self):
"""Define additional dependencies.
return ["from torchvision import datasets, transforms",
"from torch.utils.data import DataLoader"]
def training_data(self, batch_size):
In this case, we rely on torchvision functions for preprocessing the images.
"""
return ["from sklearn.preprocessing import MinMaxScaler"]
def training_data(self, batch_size):
df = pd.read_csv(self.dataset_path, delimiter=';', header=None)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
self.scaler = MinMaxScaler()
# X = [...] TODO: apply the transformer to the data.
return DataManager(dataset=X,target=y, batch_size=batch_size, shuffle=True)
n_features = 18
n_classes = 2
model_args = { 'max_iter':100,
'tol': 1e-1 ,
'loss': 'huber',
# [...] TODO: Insert the missing model arguments.
}
training_args = {
# [...] TODO: Insert the training arguments as elements in the dic.
}
Task 2: the Experiment #
The experiment enables Federated Learning by orchestrating the training process across multiple nodes. It searches for datasets based on specific tags, uploads the training plan file, sends model and training arguments, tracks and checks training progress, and downloads and aggregates model parameters for the next round.
TO_DO:
Define the used training plan.
Pass model and training arguments
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage
tags = ['heart-jupyter-%%%%']
rounds = 10
# search for corresponding datasets across nodes datasets
exp = Experiment(tags=tags,
model_args=None, #TODO: insert the correct value
training_plan_class=None, #TODO: insert the correct value
training_args=None, #TODO: insert the correct value
round_limit=rounds,
aggregator=FedAverage(),
node_selection_strategy=None)
exp.run()
Task 3: Model Validation #
During federated training, model validation plays a crucial role in assessing performance without a dedicated holdout dataset. Fed-BioMed enables separate model validation on each node after parameter updates, allowing comparison of model performances. Two types of validation can be performed:
one on globally updated parameters before training a round,
another on locally updated parameters after local training is completed on a node.
This helps users evaluate the impact of node-specific training on model improvement.
Here is the list of validation arguments that can be configured.
test_ratio: Ratio of the validation partition of the dataset. The remaining samples will be used for training. By default, it is 0.0.
test_on_global_updates: Boolean value that indicates whether validation will be applied to globally updated (aggregated) parameters (see Figure 1). Default is False
test_on_local_updates: Boolean value that indicates whether validation will be applied to locally updated (trained) parameters (see Figure 1). Default is False
test_metric: One of MetricTypes that indicates which metric will be used for validation. It can be str or an instance of MetricTypes (e.g. MetricTypes.RECALL or RECALL ). If it is None and there isn’t testing_step defined in the training plan (see section: Define Custom Validation Step) default metric will be ACCURACY.
test_metric_args: A dictionary that contains the arguments that will be used for the metric function.
TO_DO:
Initialize a new experiements.
Use the setters to define the validation arguments.
Launch the training and check the validation performances.
exp = Experiment(
# [...]
training_args=training_args
)
#TODO: set the parameters using the setters. Example: exp.set_test_ratio(test_ratio=0.1)
exp.run()