Knowledge distillation is a technique where a smaller, simpler model (the "student") is trained to mimic the behavior of a larger, more complex model (the "teacher"). The idea is to transfer the knowledge learned by the teacher model to the student model, enabling the student to achieve similar performance while being more efficient.
This is typically done by having the student model learn not just from the ground truth labels, but also from the teacher's soft predictions (the probability distribution over possible classes). This allows the student to learn richer information about the data.
Knowledge distillation is commonly used to deploy models in resource-constrained environments (like mobile devices) where smaller models are required but high performance is still desired.