# Sanitising Email


For the past 15 years or so, I've been using a simple [Perl](https://www.perl.org/) script that I wrote called [gpgit](https://gitlab.com/grepular/gpgit) to encrypt email stored on my mail server, both [incoming](/Automatically_Encrypting_all_Incoming_Email) and [outgoing](/Automatically_Encrypting_all_Incoming_Email_Part_2). It just takes a raw email on stdin and writes the modified email to stdout. It has always been in the back of my mind that I could do a lot more than just encrypting an email, from a privacy and security perspective, but I never felt I had the time to do the project justice. That is, until LLMs came onto the scene.

I have just made available an open source project called [Sanimail](https://gitlab.com/grepular/sanimail) which I have been using on my own email for a little while now. It does what gpgit does, plus a *lot* more:

- [Inlining remote content](#inlining-remote-content)
- [Policy based sanitising of html/css/svg parts](#policy-based-sanitising-of-html-css-svg-parts)
- [Removal of tracking params in links](#removal-of-tracking-params-in-links)
- [Disarming of privacy invading headers](#disarming-of-privacy-invading-headers)
- [PGP and S/MIME encryption/decryption/signing](#pgp-and-s-mime-encryption-decryption-signing)
- [Other noteworthy options](#other-noteworthy-options)
- [Hardening](#hardening)
- [Deployment](#deployment)
- [Usage tips](#usage-tips)
- [Project status](#project-status)

## Inlining remote content

One of the long standing issues with email has been pixel tracking via remote images referenced in email HTML parts:

```html
<img src="https://example.com/pixel.png?emailId=U21hcnQgYXkCg">
```

You view an email, the image is fetched, the sender can now know that you read the email, when, and what your IP was at the time. Some of the larger email providers have started addressing this problem by replacing these links with links to their own proxies:

```html
<img src="https://proxy.example.net/?url=https%3A%2F%2Fexample.com%2Fpixel.png%3FemailId%3DU21hcnQgYXkCg">
```

So when you view the email, the sender only sees the proxy's IP, not yours. Some even claim to fetch remote content and cache it as soon as the email is delivered ([Apple lies about doing this](/Apples_Protect_Mail_Activity_Doesnt_Work)). So that the sender doesn't even know if you actually read the message, let alone when, or from where.

I host my own email and I wanted this functionality for myself, so I added it to Sanimail. Technically, my solution is better because the download is permanent. The proxy solutions created by the big mail providers will expire content from their cache, causing you to re-fetch it if you look at an older email.

```shell
$ sanimail --remote-inline < in.eml > out.eml
```

This searches HTML, CSS and SVGs in the email, to find URLs that would be fetched, fetches them, attaches them to the email, and then updates the link to refer to the attachment instead of the remote URL. There are a whole bunch of limits, timeouts and image optimisations, to make this work well, with corresponding command line options. You can even proxy through Tor if you want to confuse the sender some more `--remote-fetch-proxy socks5h://127.0.0.1:9050`

```text
--remote-inline                            Fetch remote images and @font-face fonts and attach them inline (cid:); needs network

--remote-fetch-proxy string                Route remote fetches through a SOCKS5/HTTP proxy with remote DNS (works with Tor/.onion)
--remote-fetch-proxy-password string       Password for --remote-fetch-proxy authentication (visible in argv; prefer --remote-fetch-proxy-password-file)
--remote-fetch-proxy-password-file string  Read the --remote-fetch-proxy password from this file (trailing newline trimmed)
--remote-fetch-proxy-user string           Username for --remote-fetch-proxy authentication

--remote-img-deanimate                     Flatten a fetched animated GIF/APNG to its resting frame
--remote-img-deanimate-cap duration        Wall-time cap on de-animating one image; on expiry keep the frame composited so far (0 = unlimited) (default 500ms)
--remote-img-jpeg-quality int              Recompress fetched JPEGs at quality 1-100 (0 = off; re-encodes forced by other flags use 80)
--remote-img-max-height int                Downscale fetched raster images taller than N px (0 = no limit)
--remote-img-max-ram size                  Max approx peak RAM per decoded image (0 = unlimited; e.g. 384MiB) (default 402653184)
--remote-img-max-width int                 Downscale fetched raster images wider than N px (0 = no limit)
--remote-img-optimise                      Convert fetched static images to JPEG, or losslessly recompress PNG, when smaller

--remote-item-max-bytes size               Per-fetch byte cap (0 = unlimited; e.g. 8MiB) (default 8388608)
--remote-max-bytes size                    Total remote-fetch byte budget per message (0 = unlimited; e.g. 16MiB) (default 16777216)
--remote-max-count int                     Max distinct remote URLs fetched per message (0 = unlimited) (default 42)
--remote-max-parallel int                  Max concurrent remote fetches per message (0 = unlimited) (default 16)
--remote-max-parallel-per-host int         Max concurrent remote fetches to one host (0 = unlimited) (default 6)
--remote-timeout duration                  Per-fetch timeout for remote fetches (default 15s)
--remote-total-timeout duration            Aggregate remote-fetch budget per message (0 = unlimited) (default 45s)

--remote-neutralize-failures string        Neutralize failed-fetch image URLs: gone (404/410), permanent (+4xx/unusable, default), all (+403/transient) (default "permanent")
--remote-user-agent string                 User-Agent sent on remote fetches
```

This of course makes emails larger as they now include attachments. What I do with my own email is route two incoming copies to different folders. One I route to an Archive folder, and the only change I make to it is encrypting with [PGP](https://wikipedia.org/wiki/Pretty_Good_Privacy). The other email goes to my Inbox, and has the remote content inlining and various other modifications. My thinking is, that the email in my Inbox is the stuff I actually look at, and I can freely delete it rather than keeping it around for years, as I know I have the original "unmodified" version in my Archive if anything goes wrong or I need to refer to it in the future.

I recommend you use the `--remote-img-*` options if you're going to inline remote content. These can make a big difference to the byte size of the final email. 

### Deanimation

The `--remote-img-deanimate` options can be particularly effective: A lot of marketing mail embeds multi-megabyte animated GIFs and PNGs nowadays. Not only are these distracting, but they can be shrunk down to a few tens of kilobytes or less if you just capture one frame. And from what I've seen, there really is no need to see these animations. If you can capture the "resting" frame, then that has everything you need. The algorithm I used for capturing the resting frame, is to go to the end of the animation, and then work backwards until we don't have a blank frame. I also short circuit this based on wall time - If we have spent more than a certain amount of time traversing through frames, we just stop where we are and capture the current frame. The reason for this is that some animations have a lot of frames and can take a lot of processing power, and there is scope for somebody creating a malicious GIF with a lot of frames.

### Memory usage

People don't pre-optimise their images. I've seen examples of 30+ megapixel images for social media icons that are displayed using a 32x32 pixel image tag. These images can be small in byte size, because they compress well. However our max-width/height and optimise img options need to decompress these images in order to modify them and sometimes that can take hundreds of megabytes of RAM to do. So we have a `--remote-img-max-ram` to cap how much memory we are willing to use to optimise an image. If we estimate it is going to exceed this limit, then we don't bother and just attach the original. There is still a total byte size limit option for fetches so we can rely on that at least.

## Policy based sanitising of html/css/svg parts

Should an email contain Javascript? Should it contain iframes, video, forms, meta refresh tags, onclick handlers, file:// URI's, CSS @font-face? You get to define a policy yourself about what should be allowed in HTML, CSS and SVG's. Be they remote, attached, inline, hidden inside data URI's, etc. There are some preset policies as you probably don't want the hassle of doing this (despite it being quite easy). The one I recommend and use myself is "[standard](https://gitlab.com/grepular/sanimail/-/blob/main/email/policies/embedded/standard.policy)". It strips or "unwraps" (removing the tags and leaving the inner text) everything by default, except for a large list of benign allowed HTML tags, attributes, URI schemes, and CSS. I built this list up over time by including things I saw in live email, and anything else myself and Claude could come up with that might need to be in there. It will be further added to over time, especially as new HTML and CSS features are released, but I feel like it's already more than good enough. Anything it does strip, you probably wont notice or miss.

I built a separate library for this policy language and implementation called [htmlpolicy](https://gitlab.com/grepular/htmlpolicy). There is also a [minimal.policy](https://gitlab.com/grepular/sanimail/-/blob/main/email/policies/embedded/minimal.policy) embedded in Sanimail which does the opposite of the standard policy - it allows everything by default and then strips known bad markup like `script` tags and `on*` attributes etc. There are also other embedded policies where I've tried to replicate what [Protonmail](https://proton.me/mail), [Outlook](https://outlook.com), [Yahoo](https://mail.yahoo.com) and [Gmail](https://mail.google.com) do in this regard, when displaying an email. Using what public information I could find. See `sanimail policy presets` for a list and `sanimail policy export` to view the raw policies from the command line. You can easily export a preset, modify it, and then use it if you want.

What this amounts to is, `sanimail --policy standard < in.eml > out.eml` now makes it safer to use email clients like [Evolution Mail](https://help.gnome.org/evolution/), which have [privacy leaks](https://www.grepular.com/Evolution_Mail_Users_Easily_Trackable) that they have known about for years and done nothing about: [🤡](https://gitlab.gnome.org/GNOME/evolution/-/work_items/3095)

```text
# Fix known Evolution Mail privacy leaks
strip-tag link[rel=preconnect],link[rel=dns-prefetch]
```

You might wonder, what is the point in stripping script tags and other such items, because your email or webmail client will strip or ignore them anyway... Given the number of flaws that the [Email Privacy Tester](https://www.emailprivacytester.com/) has found over the years, I'd certainly not take that as a given.

## Removal of tracking params in links

`--detrack-urls` was an easy win as there is already a [dataset](https://gitlab.com/ClearURLs/rules) out there for identifying tracking parameters in URLs. It removes known tracking parameters from anchor tags in html and also URLs in text/plain parts. It also handles known redirectors - `https://redirect.example.com/redir?u=https%3A%2F%2Freal.example.com%2F` may be replaced by `https://real.example.com/` for example. A dataset is embedded in the sanimail binary, but it of course may get out of date. If you want the latest and greatest there are Sanimail commands to download a dataset, and a command line option to use that external dataset:

```shell
$ sanimail clearurls export -o ./clearurls.data # Cron this?
$ sanimail --detrack-urls --clearurls-data ./clearurls.data < in.eml > out.eml
```

## Disarming of privacy invading headers

Adding/removing/disarming/overwriting headers is all supported of course:

```text
--headers-add stringArray       Add header "Name: value", keeping any existing of that name (repeatable)
--headers-set stringArray       Add header "Name: value", replacing any existing of that name (repeatable)
--headers-strip strings         Remove headers matching comma-separated glob patterns
--headers-disarm strings        Rename matching headers to Sanimail-Disarmed-{Name}

--headers-priority-disarm       Rename priority headers to Sanimail-Disarmed-{Name}
--headers-priority-strip        Remove priority headers
--headers-read-receipts-disarm  Rename read receipt headers to Sanimail-Disarmed-{Name}
--headers-read-receipts-strip   Remove read receipt headers
```

Why should the sender be able to dictate the priority of the message to you? Do you ever actually want to send read receipts? If not, strip/disarm the header so you can never do so accidentally.

## PGP and S/MIME encryption/decryption/signing

I still encrypt my email that goes into my Archive folder using PGP using the `--pgp-encrypt` option, but I also added [S/MIME](https://wikipedia.org/wiki/S/MIME) support to Sanimail, as I can't use PGP effectively on my iPhone. Especially given that I store my PGP subkeys on external hardware ([Yubikey](https://www.yubico.com)). For live/temporary copies of email that go into my Inbox, I use `--smime-encrypt` instead.

We also have options for both PGP and S/MIME for signing, if you need that, and also decrypting. You could easily set up Sanimail to automatically decrypt an incoming email (be it PGP or S/MIME encrypted), apply a sanitisation policy, and then re-encrypt with either PGP or S/MIME if you wanted. Sanimail also supports [RFC 9788](https://www.rfc-editor.org/rfc/rfc9788.html) header protection. I don't believe there is much client support out there yet, but if you have a client that supports it, Sanimail will happily protect the headers that you want to be protected.

I decided to defer to [GnuPG](https://www.gnupg.org/) for all of the crypto for PGP and S/MIME. It is the only external dependency of the project, and you only need it if you wish to use the PGP or S/MIME options.

## Other noteworthy options

A few other options that are noteworthy:

- `--minify` - This will shrink your HTML, CSS and SVG's. Why store/transfer more bytes than necessary?
- `--strip-type` - Strip attachments with content-types that you don't trust.
- `--generate-plain` - Often, a text/plain part is missing, or is a short message pointing to a webpage, or is just malformed. We can generate our own from the html part that is usually better than the one it came with. There are several modes dictating when it is suitable to generate one.
- `--keep-amp` - We strip [AMP](https://developers.google.com/workspace/gmail/ampemail) parts by default. They're a waste of space, we don't do any sanitisation of them, and there will be a HTML part anyway so no need to keep the AMP one. If you are mad, or a Google employee, you might want to use this option to keep such parts.
- `--no-add-message-id` - By default, we add a Message-Id header when one is missing. Every message SHOULD have one according to [RFC 5322](https://datatracker.ietf.org/doc/html/rfc5322). Use this option if you disagree.
- `--strip-dark-mode` - I use dark mode on my phone and laptop. Some transactional and marketing email arrives with dark mode styles that have clearly never been looked at by a human at the sending organisation. So I just strip them, to force light mode. You can do this entirely from policy, but I didn't want to include it in any presets as it's opinionated, so this is just a helper so you don't have to write your own policy to get it. This is the corresponding policy:

    ```text
    css-strip-media prefers-color-scheme:dark
    css-strip       color-scheme
    strip-tag       meta[name=color-scheme]
    ```


## Hardening

Sanimail processes all kinds of untrusted and potentially malicious input in the form of [MIME](https://wikipedia.org/wiki/MIME) structures, HTML, CSS, SVG's and images or various formats. Because of this, I decided it was important to try and reduce the blast radius of any potential compromise. To that end, Sanimail uses [Landlock](https://docs.kernel.org/security/landlock.html) and [Seccomp](https://wikipedia.org/wiki/Seccomp) where available, to limit system calls, filesystem access and network access to what is needed. We do this as early as possible during runtime, i.e before we start reading any email. If there is a chance that we will need to launch gpg or gpgsm, Sanimail will launch a second minimal copy of it's self at startup with higher privileges which only handles launching of gpg/gpgsm and passing data back and fourth to the main process. Substantial tests have been written, including fuzzing, to try and avoid compromise in the first place of course.

## Deployment

Statically compiled binaries for various architectures can be downloaded from the Gitlab [release page](https://gitlab.com/grepular/sanimail/-/releases). You can also just clone the repo and `make build` - As long as you have [go](https://go.dev/) installed, this will work.

There are numerous ways to run Sanimail as it simply reads an email from stdin and writes to stdout. I personally use it on my [Dovecot](https://dovecot.org/) server via a [Sieve](https://wikipedia.org/wiki/Sieve_(mail_filtering_language)) filter:

```sieve
filter "sanimail" [
    "--detrack-urls",
    "--policy", "standard",
    "--policy-audit-comment",
    "--minify",
    "--generate-plain", "always",
    "--headers-read-receipts-disarm",
    "--headers-priority-disarm",
    "--remote-inline",
    "--remote-img-optimise",
    "--remote-img-deanimate",
    "--remote-img-max-width",  "800",
    "--remote-img-max-height", "600",
    "--strip-dark-mode",
    "--pgp-skip-encrypted",
    "--smime-encrypt", "mike.cardwell@example.com",
    "--gpgsm-path",    "/usr/bin/gpgsm"
];
```

You can also use an [Exim](https://www.exim.org) transport or [Procmail](https://wikipedia.org/wiki/Procmail).

## Usage tips

If you're using `--remote-inline`, the amount of time it takes to run, is heavily dictated by how many remote items need to be fetched and how quickly that can happen. In Dovecot I had to extend the default amount of time that a sieve filter can run for, because of this.

I had to include `--gpgsm-path` for my Dovecot sieve filter, because Dovecot wipes out the `PATH` env variable.

I noticed something odd on my iPhone when I started delivering a PGP encrypted version of an email to the Archive folder, and a second S/MIME encrypted version of it to Inbox. Both had the same Message-Id header, so the iPhone used this to decide that they were the same email. So when I tried to open either copy of the email, I didn't know which one I was going to get. To address this issue I added `--headers-strip Message-Id` to the filter which created the Archive version. Sanimail strips the Message-Id on that one, but then automatically generates a new unique one to add, because I don't use `--no-add-message-id`.

## Project status

We are not at v1 yet, because I am still working on this project quite a lot and want the ability to make backwards incompatible changes. I feel like the risk is fairly low that I will need to now, but I don't wish to limit myself. If there is a backwards incompatible change, you'll see it mentioned in the release notes. I have released this project (and htmlpolicy) licensed under the [AGPL](https://www.fsf.org/bulletin/2021/fall/the-fundamentals-of-the-agplv3), but with the possibility of obtaining a commercial license too. I am not interested in hearing why I should use MIT or some other license instead. One consequence of this dual licensing is that if you wish to contribute, you'll need to sign over the rights for me to relicense your contributions. Feature requests, suggestions and bug reports without corresponding patches are welcome of course.