To ask a model to provide sources or cite documents, users can explicitly include instructions in their prompts, such as "Cite the sources you used" or "Include references to specific documents." However, the model’s ability to comply depends on its architecture. Models without direct access to external data (like standard LLMs) generate responses based on patterns in their training data, not real-time document retrieval. They may invent plausible-looking citations that reference non-existent papers, books, or URLs, or inaccurately attribute information to valid sources. For models augmented with retrieval systems (e.g., those connected to databases or search engines), citations can be more reliable if they directly reference retrieved content, though accuracy still depends on the quality of the retrieval process.
A key challenge in evaluating citations is verifying their authenticity. For example, a model might generate a citation like "Smith et al. (2020) found..." without linking to a real study. Validating this requires cross-checking databases like PubMed or Google Scholar, which is time-intensive. Even when sources exist, the model might misrepresent their content—for instance, citing a valid paper but oversimplifying its conclusions. Additionally, models often paraphrase information, making it harder to trace claims back to exact sources. Citations to paywalled articles or proprietary documents further complicate verification, as reviewers may lack access. Formatting inconsistencies (e.g., incorrect DOI links) also hinder validation.
Another issue is the model’s tendency to "hallucinate" credible-sounding references. For example, it might invent a URL like "https://research.org/study123" that resembles a real domain but leads nowhere. Without built-in mechanisms to fact-check citations during generation, users must manually verify each reference, which isn’t scalable. Solutions like integrating retrieval-augmented generation (RAG) systems or grounding responses in predefined document sets can improve accuracy, but these require infrastructure and curation. Ultimately, evaluating citations demands a combination of automated checks (e.g., validating URLs) and human oversight, which adds complexity to deploying reliable citation workflows.