N-grams are contiguous sequences of n items (typically words or characters) extracted from text. For example, in the sentence "I love NLP," the unigrams (1-grams) are ["I", "love", "NLP"], the bigrams (2-grams) are ["I love", "love NLP"], and the trigrams (3-grams) are ["I love NLP"].
N-grams are widely used in NLP tasks such as language modeling, text generation, and machine translation. They help capture local patterns and dependencies in text. For instance, bigrams in a corpus might reveal common phrase structures like "thank you" or "machine learning." However, n-gram models can struggle with long-range dependencies, as they only account for fixed-length contexts.
While simple and interpretable, n-grams can lead to sparse representations for large vocabularies or datasets, as the number of possible n-grams grows exponentially with n. Modern NLP approaches, like transformers, have largely replaced n-gram-based methods for capturing context. Nonetheless, n-grams remain useful in preprocessing and feature extraction for tasks such as text classification or keyword extraction.