Zero-shot learning (ZSL) models are evaluated using several common benchmarks that help to measure their effectiveness and performance. These benchmarks typically involve tasks that require the model to recognize classes that it has not seen during training. Common datasets used include the Animals with Attributes (AwA), Caltech-UCSD Birds (CUB), and Pascal VOC. Each of these datasets provides a rich set of attributes or descriptions that can help the model generalize to unseen classes, making them ideal for ZSL evaluation.
One widely used benchmark is to assess how well the model can transfer knowledge from seen to unseen classes. This is usually done by training the model on a subset of classes (the seen classes) and then testing it on a different set of classes (the unseen classes). The performance is often measured using classification accuracy, where the percentage of correctly identified examples from the unseen classes is calculated. Developers may also look at metrics such as precision and recall to get a more nuanced understanding of performance across different classes.
Another important aspect of evaluation involves the use of semantic embeddings, which represent classes in a way that captures their relationships. Popular methods for this purpose include using word vectors from Word2Vec or GloVe to encode class attributes. The evaluation then checks how well the model can predict unseen classes based on their relationship to the seen classes in this semantic space. Researchers may also conduct ablation studies to see how removing certain components affects model performance. Overall, these benchmarks and evaluation methods provide a clear picture of how well zero-shot learning models can bridge the gap between known and unknown categories.