GovScape lets you easily search millions of government documents

Stefan Milne — Wed, 24 Jun 2026 16:00:56 +0000

A ��Ӱ�Ӵ�ý-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like “FAFSA,” or use a visual search option to query for qualities like “redacted documents.” Photo: ��Ӱ�Ӵ�ý

At the end of every presidential term, the preserves that administration’s web presence as a vast trove of documents and webpages. The archive began in 2008, with George W. Bush’s second term, and runs up to 2024, collecting images, text, graphs, redacted pages and other media. So while it contains important public information, finding that information in the glut can prove difficult.

A ��Ӱ�Ӵ�ý-led research team created , an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like “FAFSA,” or use a semantic search, which finds documents on a topic even if the exact search terms don’t appear on the page. A visual search option lets them query for qualities like “redacted documents,” “aerial photographs” or “pie charts.” The system can currently search the 10 million PDFs hosted online during Donald Trump’s first term; the team plans to expand it to the whole archive.��

Because researchers used highly efficient artificial intelligence models to read the documents, processing all the PDFs costs less than $1,500, or about $1 per 47,000 pages. By comparison, Google might charge consumers .��

The team will July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego.��

“The End of Term Web Archive is immensely important to historians, journalists and the American public,” said senior author , a UW assistant professor in the Information School. “But many of these digital archives are getting so big — just announced its trillionth page archived — that finding information is the real challenge.”

The team worked with PDFs because they are a ubiquitous file format and can contain text, charts and images — a mix that is challenging for existing search systems but makes the documents ideal candidates for GovScape’s multimodal search.��

They built a pipeline to process all the documents that splits each PDF into individual pages, saves the pages as images, then pulls out the text. The researchers used highly efficient AI models to generate “embeddings” for both the text and images from each page. Embeddings are essentially a string of numbers that systematically capture the text and images’ content.

Try the

“Just as library classification systems group books on similar topics on the same shelf, these embeddings group similar pages with one another based on their visual and textual content,” Lee said.

Researchers then built different indexing systems for the three kinds of search. The keyword search uses a basic index — similar to a book index — for all the text. If a user types in “FAFSA,” the system finds all the pages the word appears on.��

For semantic and image searches, the system takes the user’s search term and creates an embedding. It then compares this embedding with the indices created from the embeddings of PDF pages and identifies the closest matches, which are returned as search results.��

“Our next goal is to cover all of the 70 million PDFs in the entire End of Term Web Archive — everything from 2008 to 2024,” Lee said. “One of the challenges moving forward is how to efficiently search at that scale.”��

Because government archives contain “every file type under the sun,” Lee said, future work might expand to documents such as spreadsheets, images and HTML pages.��

“I’m really excited about the prospects for better access to government information with projects like GovScape,” Lee said. “Being able to actually find relevant information is vital to the health of democracy and to the functioning of society.”

Co-authors include of Boston University, who completed this research as a doctoral student in the Paul G. Allen School of Computer Science & Engineering; and , who completed this research as UW master’s students in the Information School;,,, , and , all students in the Allen School; of Harvard University; of the Massachusetts Institute of Technology; of the University of North Texas; and of the American Institute of Physics.��

For more information, contact Lee at bcgl@uw.edu.

Benjamin Charles Germain Lee – UW News

GovScape lets you easily search millions of government documents

Related