Benjamin Charles Germain Lee – UW News /news Wed, 24 Jun 2026 22:42:31 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 GovScape lets you easily search millions of government documents /news/2026/06/24/govscape-lets-you-easily-search-millions-of-government-documents/ Wed, 24 Jun 2026 16:00:56 +0000 /news/?p=92203 A search for 鈥渞edacted documents鈥 on a search engine.
A 天美影视传媒-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like 鈥淔AFSA,鈥 or use a visual search option to query for qualities like 鈥渞edacted documents.鈥 Photo: 天美影视传媒

At the end of every presidential term, the preserves that administration鈥檚 web presence as a vast trove of documents and webpages. The archive began in 2008, with George W. Bush鈥檚 second term, and runs up to 2024, collecting images, text, graphs, redacted pages and other media. So while it contains important public information, finding that information in the glut can prove difficult.

A 天美影视传媒-led research team created , an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like 鈥淔AFSA,鈥 or use a semantic search, which finds documents on a topic even if the exact search terms don鈥檛 appear on the page. A visual search option lets them query for qualities like 鈥渞edacted documents,鈥 “aerial photographs鈥 or 鈥減ie charts.鈥 The system can currently search the 10 million PDFs hosted online during Donald Trump鈥檚 first term; the team plans to expand it to the whole archive.听

Because researchers used highly efficient artificial intelligence models to read the documents, processing all the PDFs costs less than $1,500, or about $1 per 47,000 pages. By comparison, Google might charge consumers .听

The team will July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego.听

鈥淭he End of Term Web Archive is immensely important to historians, journalists and the American public,鈥 said senior author , a UW assistant professor in the Information School. 鈥淏ut many of these digital archives are getting so big 鈥 just announced its trillionth page archived 鈥 that finding information is the real challenge.鈥

The team worked with PDFs because they are a ubiquitous file format and can contain text, charts and images 鈥 a mix that is challenging for existing search systems but makes the documents ideal candidates for GovScape鈥檚 multimodal search.听

They built a pipeline to process all the documents that splits each PDF into individual pages, saves the pages as images, then pulls out the text. The researchers used highly efficient AI models to generate 鈥渆mbeddings鈥 for both the text and images from each page. Embeddings are essentially a string of numbers that systematically capture the text and images鈥 content.

Related

Try the

鈥淛ust as library classification systems group books on similar topics on the same shelf, these embeddings group similar pages with one another based on their visual and textual content,鈥 Lee said.

Researchers then built different indexing systems for the three kinds of search. The keyword search uses a basic index 鈥 similar to a book index 鈥 for all the text. If a user types in 鈥淔AFSA,鈥 the system finds all the pages the word appears on.听

For semantic and image searches, the system takes the user鈥檚 search term and creates an embedding. It then compares this embedding with the indices created from the embeddings of PDF pages and identifies the closest matches, which are returned as search results.听

鈥淥ur next goal is to cover all of the 70 million PDFs in the entire End of Term Web Archive 鈥 everything from 2008 to 2024,鈥 Lee said. 鈥淥ne of the challenges moving forward is how to efficiently search at that scale.鈥澨

Because government archives contain 鈥渆very file type under the sun,鈥 Lee said, future work might expand to documents such as spreadsheets, images and HTML pages.听

鈥淚’m really excited about the prospects for better access to government information with projects like GovScape,鈥 Lee said. 鈥淏eing able to actually find relevant information is vital to the health of democracy and to the functioning of society.鈥

Co-authors include of Boston University, who completed this research as a doctoral student in the Paul G. Allen School of Computer Science & Engineering; and , who completed this research as UW master鈥檚 students in the Information School;,,, , and , all students in the Allen School; of Harvard University; of the Massachusetts Institute of Technology; of the University of North Texas; and of the American Institute of Physics.听

For more information, contact Lee at bcgl@uw.edu.

]]>