TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in information retrieval (IR) to evaluate the importance of a term in a document relative to a collection of documents. It combines two components: term frequency (TF) and inverse document frequency (IDF).
TF is the number of times a term appears in a document, while IDF measures how common or rare a term is across all documents. The formula for TF-IDF is the product of these two values: TF-IDF = TF * IDF. If a term appears frequently in a document but is rare across all documents, it will have a high TF-IDF value, indicating that it is important for that document.
For example, if the term "neural network" appears frequently in a document but rarely in the overall corpus, the TF-IDF value for "neural network" will be high, signaling its relevance to the document. TF-IDF is widely used for ranking search results, text classification, and document clustering, as it helps identify the most significant terms in a document.