MailQuery

written by Neh-Thalggu on 2025-08-12

I admit, I like jQuery. Not jQuery as a full web UI framework, but the notion of a simple, "fluent" interface for accessing something complicated like the DOM, that jQuery introduced me to.

Every now and then I need to do something with another complicated thing and a light-bulb goes off in my head and I think "I wish I had an xQuery for that."

I originally did it when I got bored with crawling around my file-system the old-fashioned way. And wrote https://github.com/interstar/FSQuery

Meanwhile, I've always hated Gmail filters. And increasingly I've been thinking about whether AI might finally help to get my aburdly cluttered Gmail inbox under control.

But then I realised I just wanted a better way to talk to my mailbox from Python. So I've been vibe-coding for a few days, and here's the result, a Python library called "mailquery" : https://github.com/interstar/mailquery

Here's how I can currently manage my Gmail account, in the REPL with a fluent interface and FP attitude.

Import from the library

from mailquery import GmailClient, Mailbox, SQLiteStorage, OR, spit, HTMLPageBuilder, SenderCollector, EmailStatistics

Create the gmail client. The credentials are in "credentials.json". Note the library also has a plain IMAP client

client = GmailClient("credentials.json",allow_delete=True)

Create mailbox from the client. Filter for emails later than 1st August, 2025 that are from either one of hello@restofworld.org or ai.plus@axios.

news = Mailbox(client).after("2025-08-01").from_(OR("hello@restofworld.org","ai.plus@axios.com",))

Note this is a "fluent" interface where you chain filters together with the dot operator. Each filter call adds itself to the Mailbox and returns the Mailbox. There are standard built in ones like from_, to, before, after, subject_contains etc. You can add as many as you like.

OR is a special class that represents any of the arguments. from_(OR("hello@restofworld.org","ai.plus@axios.com")) matches emails from either of these two addresses. The OR can handle as many values as you give it.

There are also the two generic filters : include_when() and exclude_when() These take "predicate" functions. That is, functions which map from an email object to a True/False value

Make sure you understand this concept clearly. "include_when()" means if the predicate returns True, then this filter allows the email through to the next stage "exclude_when()" means that if the predicate returns True, the mail is blocked from passing through to the next stage. But will pass through if the predicate returns False

Anyway, after the last command, "news" now contains a mailbox with the after and from_ filters attached.

But note that nothing has actually run yet. MailQuery is a pretty lazy library.

Let's now just list the mails that match. verbose=False prints just the basic header, verbose=True would print a bunch of extra warnings and diagnostics.

news.list_all(verbose=False)

This has now triggered the query to execute. It ran through the mailbox looking for emails that matched the filters.

Note that all the filtering is done on the client. Which means downloading all the email headers to test them against the criteria BUT as an optimisation for Gmail, both the after() filter and the from_() filter get turned into Gmail server-side filters by our GmailClient class. This makes things MUCH faster, but is completely transparent to the user. You don't need to know anything about Gmail to get this benefit. It's entirely handled by the GmailClient class we used when we created the Mailbox object.

If you aren't using Gmail, or a server where we can use such optmisations, then MailQuery will just chug along, slowly, downloading all the mail and filtering it locally. It's much slower, but it works identically

Remember I said MailQuery was lazy?

At this point the mail headers have been downloaded and cached. If we run list_all() again, we'll just pull them out of the local cache.

However the bodies of the emails are not yet downloaded. And won't be until something calls a get body or get html type function on them. body_contains(), a test whether a string is in the body of the email, is one such filter. If included in the filter chain it will force a download of the email body.

Once bodies are downloaded, the Email objects in the mailbox will cache that too.

Perhaps we want to store the emails that matched our criteria. A Storage object can do that. And we place a store_local() end point on our filter chain.

storage = SQLiteStorage("newsletters.db")
news.store_local(storage)

And that's it, the emails are stored in the database. We can now delete them from the gmail account

news.delete()

But wait, there's more! These newsletters have all their useful stuff in an HTML attachment field. That's a pain. We can extract the text from the html with Beautiful Soup. But we don't really want to hardwire Beautiful Soup into our mail library do we?

We would like a way to optionally process a field in the mail record into some further information.

We also want to store this extra information in the email.

Fortunately, MailQuery has a mechanism for that : add_attribute(att_name,fn)

add_attribute() works like a filter. But it always returns True. It never removes anything from the mailbox.

What it DOES do is transform the email in some way. The function fn needs to be a function that maps emails to some other value. att_name is a string . The add_attribute filter calls fn on each mail passing through, and stores the result in an extra attribute under the name `att_name

For example, we can use Beautiful Soup to extract text from the html like this.

def extract_html_text(email):
    try:
        html_content = email.get_html()
        if html_content:
            soup = BeautifulSoup(html_content, 'html.parser')
            return soup.get_text(separator=' ', strip=True)
        return ""
    except Exception as e:
        print(f"Error extracting HTML text: {e}")
        return ""

Now we can add this as a new attribute

news.add_attribute("html_as_text",extract_html_text).store_local(storage)

Not only does the add_attribute add the extra attribute to all the emails. It also registers the extra attribute name in the Mailbox

Then when we call store_local(storage), storage finds out about the extra attribute name and creates an extra column for it in the SQLite database

But wait there's MORE ....

We can filter. We can "map" mails to other values? Why not have trifecta and add some kind of fold or reduce?

list_all(), delete() and store_local() are three functions we can add at the end of the pipeline to do something with our email

But we also have a generic way to roll them up to get some kind of aggregate value out of them.

Unsurprisingly it's called reduce_all(reducer)

Here's what a Reducer has to look like

class Reducer :
    def init_value(self) : # returns initial value
    def fold(self, next:ParsedEmail)  : # folds the next email into its internal accumulator 
    def final(self) : return the final value of the accumulator

Let's use one to count the emails in our mailbox

class Counter(Reducer):

    def init_value(self):
        self.count = 0

    def fold(self, email: ParsedEmail):
        self.count += 1

    def final(self) -> int:
        return self.count


mails.reduce_all(Counter())

Something more useful is to discover all the unique email addresses in a matching mail set

class SenderCollector(Reducer):
    """Collect all unique sender email addresses"""

    def init_value(self):
        self.senders: Set[str] = set()

    def fold(self, email: ParsedEmail):
        self.senders.add(email.envelope['from'])

    def final(self) -> List[str]:
        return list(self.senders)

mails.reduce_all(SenderCollector())

I hope you see how this thing starts to stack up

Apart from the delete() WHICH IS BLOODY DANGEROUS ... BE VERY CAREFUL WITH IT

Apart from the delete() nothing else changes your mailbox. And everything gets cached locally

So once you've run a mailbox filtering once, whether to print or store, runnning these further reducers will just iterate through mails in memory

MailQuery is powerful for any automated email application. But my goal is to be able to use it in Python REPL to interact with my Gmail dynamically but programatically.

One thing missing from all the automation above is a way to bring the human back into the loop. Sometimes even I can't think of a criterion for deleting or saving emails except by reading the mail and making a personal decision

Never fear, we can bring the human back simply by invoking human()

mails.subject_contains("important").human().delete()

human() works like any other filter on the email chain. Except it turns the terminal into an interactive window where a human user can review and decide the fate of an email. The human can type "D" or SPACE. "D" is obviously intended to signal "delete this email". In fact, because human() is just another filter stage, what it ACTUALLY does is return a True, indicating that the email is to pass through to the next stage of the filter chain. SPACE on the other hand returns False, removing the email from the filter chain. These obviously only work as intended if the human() is slotted into a chain which ends at a delete(). You could obviously use the human() elsewhere, but you are likely to confuse yourself or your users. Again, MailQuery is a dangerous power-tool which will hurt you if you are not incredibly careful.