Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical method used in NLP to represent text by quantifying the importance of words in a document relative to a corpus. It combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF). TF measures how often a word appears in a document, while IDF evaluates how unique that word is across the entire corpus. The formula for TF-IDF is:
TF-IDF = TF × IDF, where TF = (Word Count in Document) / (Total Words in Document) and IDF = log(Total Documents / Documents Containing the Word).
Words like "the" or "and" (stop words) may have high term frequency but low IDF because they occur in almost every document, resulting in low TF-IDF scores. Conversely, rare and informative words have high TF-IDF values. TF-IDF is commonly used for text representation in information retrieval, text mining, and search engines. It helps highlight key terms in documents, making it easier for models to focus on relevant features. Although less powerful than embeddings, it remains a practical and interpretable feature extraction method for smaller datasets and simpler NLP tasks.