
Try it out
Problem
A lawyer friend of mine was struggling searching through a corpus of emails which were part of the discovery process for a lawsuit. He had a general sense of what he was looking for, but didn't know what exact words to look up. Most search engines search by keyword, but what he needed was one that searched semantically based on the meaning of the query.
Solution
I built a semantic search engine, called Searchica, which enables the efficient exploration of a corpus of documents through semantic search and visualization. The application includes a list of emails ranked by similarity to the query, and a plot where emails are points positioned based on semantic meaning. These two abstract semantic representations side by side allows a conversation between them that forwards meaningful exploration.
Technical Detail
Vector Embedding System: I choose to embed the documents and query using a the msmarco-MiniLM BERT transformer, due to its light weight, speed, and strong ability to encode semantic meaning. I stored the encodings in a SQLite database with JSON serialization for fast application start-up.
Search Implementation: When parsing the documents, I identified intra-document components (for example in an email: the subject, body, and header). When I do the search, I compare the query embedding with each document field using a cosine distance function, and take a weighted sum to get a single similarity score per document. This allows me to encode, for example, that the subject line is more important semantically than the bcc line.
Visualization Processing: I used PCA dimensionality reduction to map the latent space document encodings to 2D points. I used a normalized score to color nodes on the plot. At first I tried a traditional normalizing algorithm of dividing every score by the max score. One issue with this is that even if the corpus has no good matches to the query, the best of the bad matches gets normalized to 1, same as if it were truly a good match. Another issue, which I found experimentally, is that only the top 1% of scoring documents have any relevance to the prompt. A simple linear normalizer doesn't do enough to highlight these top results. I wanted an algorithm that was invariant between queries, and exploded the highest scores, and then I realized the exponential function has exactly these properties, so that's what I used.
Web App / Deployment: I used a Flask backend with a React frontend. I containerized with Docker and deployed to Google Cloud Run.
Architecture: Guided by SOLID object oriented design principles, I split each backend functionality into a single responsibility component. I made an abstract document class, from which specific document types inherit. This makes Searchica extensible to other document types without re-writing any core backend code (admittedly though I got a bit lazy with the polymorphism on the frontend). Careful interface segregation allowed me to test each class independently with Pytest. The frontend uses event-driven design principles to be responsive to user interaction.
Source Code