The Archive of 5,000 Lives

Visualize a massive digital archive—the digitized, transcribed diaries of 5,000 World War I soldiers. Each entry is a raw, unedited window into a specific moment: the muddy reality of the trenches, the heartbreak of loss, or a brief, humorous anecdote about rations.
This is where historians traditionally excel. Through meticulous close reading, researchers can spend years embedded in a handful of these diaries, analyzing deep metaphors and subtle linguistic shifts. This analysis is the bedrock of scholarship, providing a profound understanding of individual experiences.
But how do you understand the experience of the entire archive?
We often become so focused on the intricate details of a single “tree” that we struggle to see the broader “forest.” This tension between individual analysis and systemic overview is where distant reading—using computational patterns to scan thousands of texts—becomes invaluable. To navigate that massive digital forest, you need a compass. That compass is topic modeling.
Hands on Activities:
Here is the data you will need for the hands on activity for the workshop:
- Download the zip file and unzip it.
- Open NotebookLM, and create a new notebook.
- Upload the 26 text files each named for a year in the 1800’s. It may take a couple of minutes for NotebookLM to index the files before you can start asking questions.
- Start asking questions. Here are some questions to get you started:
- What are the top 10 themes or trends in these documents?
- What themes in the documents specifically relate to the indigenous population or individuals?
- What strategies were suggested to “civilize” Indigenous people?
- What other questions or themes would you like to explore?
- Create and infographic on a topic of interest you.
- Create an Audio overview
Here is a general purpose NotebookLM workshop with activities that might be helpful if you’d like to learn more about additional capabilities of NotebookLM.
The Presentation Slides for this workshop
Decoding Topic Modeling: Statistics vs. Semantics
Traditionally, topic modeling is a form of statistical discovery. It is a mathematical tool for uncovering underlying themes in large text collections. For decades, the gold standard has been LDA (Latent Dirichlet Allocation).
LDA operates on the logic of word co-occurrence. If words like “ship,” “sail,” and “port” consistently appear together, the model statistically groups them into a “Maritime” topic. While effective, it relies entirely on vocabulary patterns rather than a true understanding of meaning. It cannot grasp the nuance when “voyage” is used metaphorically.
Modern tools like NotebookLM redefine this paradigm. Instead of just counting words, NotebookLM uses Large Language Models (LLMs) to understand meaning and context. It can differentiate between a soldier using “shell” to describe artillery versus a literal seashell. This pivot from pure statistics to deep semantics allows for a far more accurate form of distant reading.
NotebookLM: A Grounded Knowledge Base
NotebookLM’s defining strength is that it grounds the AI in your specific sources.
Consider the distinction: when you prompt a general tool like ChatGPT, it queries the entire internet. This is useful for general knowledge but risky for specialized historical research. Generic models are prone to hallucination, fabricating details that aren’t in your text to generate a plausible-sounding answer.
NotebookLM, conversely, builds a private knowledge base derived exclusively from the documents you provide. It is a closed-loop system. This ensures the “topics” it identifies are actually present in your data, not imported from a generic training set.
Workflow: From Upload to Insight
Using NotebookLM for topic discovery transforms a complex data science task into a streamlined research workflow:
- Upload: You begin by uploading your source files—the PDFs or docs containing the 5,000 diaries.
- The Notebook Guide: Once uploaded, the tool automatically analyzes the corpus and suggests top-level themes. It provides the initial, high-level map of your “forest.”
- The Semantic Query: You can then use the chat to ask complex, contextual questions:
- “What are the primary recurring themes across these diaries?”
- “How does the tone of ‘patriotism’ change from 1914 to 1918?”
Note: The free version of NotebookLM limits you to uploading 50 documents, and the paid version is limited to 250 documents
The Historian as Captain
While NotebookLM is a revolutionary tool, it functions as a co-pilot, not a replacement.
AI lacks genuine historical consciousness. It can identify clusters and themes efficiently, but it often misses irony, sarcasm, or period-specific idioms. It might cluster all mentions of “mud” together, failing to distinguish between physical conditions and metaphorical despair.
The historian remains essential. You must verify the AI’s “topic clusters” against the original text. Use the AI’s findings as a starting point for analysis, not a final conclusion. The most important question a researcher can ask of the output remains: “What did the AI miss?”