Building Full-Text Search for BibDesk PDFs
Here’s how most of my teaching prep days went last semester: “I know I’ve read papers using propensity score matching, but I have no idea which ones,” or “surely people have put histograms in published papers,” or “can I find an example of how authors describe calipers when they’re doing matching?”
I love my BibDesk citation management setup, which has been going strong for 10 (!) years now. Almost every paper I’ve read since starting grad school is in there, and it’s easy to search for authors, titles, and my inconsistently-applied tags. BibDesk also makes it easy to link your downloaded PDF of of the article to its entry in the manager, so you can pull up the PDF with your annotations.
However, there’s no way to do a full-text search over the linked PDFs. When you’re prepping methods classes and trying to find examples of specific methods, those details are rarely going to be in the abstract or tags. (I definitely don’t have a “uses a histogram” tag). I can’t search the PDFs using the OS/Spotlight, because the PDFs are scattered in different folders: classes I’ve taken, classes I’ve taught, or project-specific folders.
Last semester, I hacked together some code (with LLM assistance) to build full-text search for the PDFs linked to in my BibDesk file. That code is on Github, but is very much provided “as-is”.
The code does a few things: it parses the .bib file to find PDF paths, extracts text from all PDFs, then searches that text with regex. I use PyMuPDF to render snippets of the PDFs with highlighted matches and serve it all up in a Streamlit interface.
The interface has a few options on the side, allowing you to enter the path to your .bib file, build the initial text index, or update it as you add new citations. The side panel also has options for removing keyword matches in the references section. In the main panel, you can enter your search term and get all the articles with matching results. Since this is all drawing on the .bib file, we have the complete citation information for each article.
Expanding the results shows where in the document the term was mentioned. This ends up being really helpful for quickly determining if the term is just mentioned in passing or is irrelevant. The highlighting isn’t 100% perfect and definitely the slowest part of the process: it finds the match on the page, renders the PDF, and converts it to an image. These images get cached, and the cache can get pretty big, so there’s a button to reset the cache.
Clicking the “Open PDF” button will open your original PDF, including any markup you might have on the PDF.
Text search works surprisingly well at finding specific kinds of data visualization.
This isn’t anything fancy: no LLMs, semantic search, or boolean operators. The highlighting doesn’t handle line breaks well, and there’s no OCR support. But it was a good reminder to me of how useful simple keyword matching can be.
This tool has been really useful as I prep lectures. One of the most important aspects of teaching methods is providing examples from the literature, but one of the slowest parts of prep is searching through papers for those examples. Now instead of spending 15 minutes hunting for “a paper that talks about calipers,” I can pull up a few good examples in a minute.
To get future blog posts in newsletter form, you can sign up here: