Synthesizing scientific literature with retrieval-augmented language models
Key Points:
- OpenScholar is a new retrieval-augmented language model (LM) designed to provide reliable, citation-aware responses to scientific literature queries by retrieving relevant papers, synthesizing findings, and generating responses with inline citations for transparency and verifiability.
- The system addresses challenges of high-precision retrieval from a massive scientific corpus, accurate synthesis without hallucination, and citation alignment by building OSDS, a data store of 45 million papers, and employing a pipeline with a bi-encoder retriever, cross-encoder reranker, and iterative self-feedback inference for refining outputs.
- OpenScholar generates synthetic high-quality training data via its inference pipeline, combining final and intermediate outputs, and mixes this with general and scientific instruction-tuning data to train specialized L