[MlMt] Improving search performance?

Bill Cole mmlist-20120120 at billmail.scconsult.com
Mon Jun 26 14:51:22 EDT 2023


On 2023-06-25 at 10:59:53 UTC-0400 (Sun, 25 Jun 2023 16:59:53 +0200)
Robert M. Münch <mailmate at lists.freron.com>
is rumored to have said:
[...]
>> One serious issue with indexing email is that email is highly 
>> divergent in data structure, and while you can do a simple index for 
>> basic standard mail metadata, "full text" and "all headers" search 
>> for mail is a nightmare because real-world mail breaks almost every 
>> rule theoretically governing it and it is not a simple matter to 
>> determine what is or is not body text. Email typically arrives with 
>> multiple alternative parts theoretically representing the same 
>> message, possibly QP or B64 encoded and usually including one version 
>> with HTML markup. And that markup can be bad, wrong, or even 
>> intentionally malicious.
>
> Well, MM already handles all this, otherwise we couldn't use it as we 
> do. Those parts are will known to MM.

I've had a bug open for quite a while regarding a MM parsing problem 
with pure text messages generated by automated tools.

I don't know what the root cause of that is, but I am certain that Benny 
does not have all the arcana handled.

>> Very large mail stores are inherently tough to search.
>
> After pre-processing all the mail mess, I don't think so. Searching in 
> Gmail is fast. MM is already much better than other clients.

I haven't looked in a long while but last I checked, GMail could not 
search on arbitrary headers. Have they fixed that?

That's a huge part of the scaling problem. There are not a lot of people 
who really use that feature, but we do value it highly. The extremely 
long tail of headers and full-text tokens that only appear in a small 
number of messages makes mail particularly hard to search efficiently if 
you include all the garbage spam full of 'hashbusters' and such.

> IMO the use-case search *1+ million emails as fast as possible* is 
> just not in scope for most of the clients.

Right, because most users do not need that. I don't know that any mail 
client does it as well as MM with the same search capabilities.

-- 
Bill Cole
bill at scconsult.com or billcole at apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


More information about the mailmate mailing list