About¶
History¶
Pyzor initially started out to be merely a Python implementation of Razor, but due to the protocol and the fact that Razor’s server is not Open Source or software libre, Frank Tobin decided to implement Pyzor with a new protocol and release the entire system as Open Source and software libre.
Protocol¶
The central premise of Pyzor is that it converts an email message to a short digest that uniquely identifies the message. Simply hashing the entire message is an ineffective method of generating a digest, because message headers will differ when the content does not, and because spammers will often try to make a message unique by injecting random/unrelated text into their messages.
To generate a digest, the 2.0 version of the Pyzor protocol:
Discards all message headers.
If the message is greater than 4 lines in length:
Discards the first 20% of the message.
Uses the next 3 lines.
Discards the next 40% of the message.
Uses the next 3 lines.
Discards the remainder of the message.
Removes any ‘words’ (sequences of characters separated by whitespace) that are 10 or more characters long.
Removes anything that looks like an email address (X@Y).
Removes anything that looks like a URL.
Removes anything that looks like HTML tags.
Removes any whitespace.
Discards any lines that are fewer than 8 characters in length.
This is intended as an easy-to-understand explanation, rather than a technical one.