Model accuracy in federated learning is evaluated by aggregating performance metrics from multiple client devices or nodes without transferring the raw data. Each client trains a model locally on its own dataset and computes evaluation metrics, such as accuracy or loss, based on a subset of its data, often called a validation set. Once this local evaluation is done, the metrics can be shared with a central server, which combines them to get an overall picture of model performance across all clients.
One common approach to aggregating accuracy is to take the weighted average of each client's accuracy, with the weight often corresponding to the size of the local dataset. For instance, if one client has a large dataset while another has only a few samples, the accuracy from the larger dataset would have a more significant influence on the global metric. This ensures that the evaluation reflects the performance of the model across diverse datasets, providing a more representative assessment of its accuracy after the aggregation.
In practice, this might involve keeping track of various metrics such as precision, recall, or F1 scores, depending on the application's needs. Developers might also implement mechanisms to handle cases where some clients have skewed data distributions or outliers, as these can impact overall model performance evaluations. By interpreting these metrics correctly, developers can make informed decisions about adjusting model parameters, selecting clients for the next training round, or implementing specific strategies to address imbalances in the data.