NearDeduplication Behavior
NearDeduplication Behavior are listed for the following processing types:
Journal Mails
In case of mass marketing, emails are sent to bulk users with external BCC participants which creates unique journal copies for the same email causing neardupes.
NearDedupe: NO |
NearDedupe: YES |
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using: From the Original email (inside the envelope)
|
Result Unique GCIDs were created for the near-dup emails hence the Enterprise Archive searches results in unique email documents for near-dupe emails. |
Result Same GCIDs created for the near-dupe emails hence only one email document will be discoverable from the Enterprise Archive searches. |
Use-Case
Incoming EmailsTwo uniques Journal copies arrive at EGW (one with Bcc participants) wrapped inside with the same email
|
|
NearDedupe: NO |
NearDedupe: YES |
View in Enterprise Archive - eDiscovery Application Two unique documents without deduplication of NearDedupe emails.
|
View in Enterprise Archive - eDiscovery Application Only one document discovered out of two NearDedupe emails ingested.
|
Participants The original email wrapped inside the NearDedupe journals carries the same participant’s list as Enterprise Archive:
|
Participants The original email wrapped inside the NearDedupe journals carries the same participant’s list as Enterprise Archive:
|
GCIDs If NearDedupe is configured as NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive. Sample Email 1
Sample Email 2
|
GCIDs When NearDedupe is configured as YES, Email Gateway creates the same GCIDs for the two near dupe emails hence single email discovered in Enterprise Archive. Sample Email
|
Direct Mails
Direct messages are the ones that are not wrapped inside a journal envelope. Instead, are sent directly to Email Gateway.
NearDedupe: NO |
NearDedupe: YES |
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
Result Unique GCIDs created for the near-dup emails hence the Enterprise Archive searches results in unique email documents for near-dupe emails |
Result Same GCIDs created for the near-dupe emails hence only one email document will be discoverable from the Enterprise Archive searches |
Messages Extracted Directly from Users' Mailboxes
In some data migration use cases, emails can be directly extracted from users' mailboxes and archived in Enterprise Archive through Email Gateway.
NearDedupe: NO |
NearDedupe: YES |
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
Mailbox Owner example use-case
Alex, a user sends an email to another user, Diego with the subject ‘Some Subject'. When their mailboxes are extracted, the Smarsh’s mailbox owner header "X-SMARSH-MAILBOX-OWNER" is added to the journal wrapper of each message extracted from each user’s mailbox. The email with the subject "Some Subject" is extracted twice, once from Alex’s and once from Diego’s mailboxes respectively.
Let’s see how the NearDedupe works in the case of the NO and YES settings.
NearDedupe: NO |
NearDedupe: YES |
Incoming EmailsTwo unique Journal wrapper with the Mailbox owner XHeader containing the email with subject: 'Some Subject' arrive at EGW
View in Enterprise Archive - eDiscovery ApplicationTwo unique emails discovered in Enterprise Archive for the emails extracted from Alex and Diego’s mailboxes:
|
Incoming EmailsTwo unique Journal wrapper with the Mailbox owner XHeader containing the email with subject: 'Some Subject' arrive at EGW
View in Enterprise Archive - eDiscovery ApplicationSingle email discovered in Enterprise Archive for the emails extracted from Alex and Deigo’s mailboxes:
|
Participants Near-duplicate document Journaled from Exchange On-Prem
Near-duplicate document Journaled from O365 which added the On-Prem user is an additional participant
|
Participants All participants from On-Prem and Cloud Exchange are captured in single document
|
GCIDs If NearDedupe=NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive.
|
GCIDs If NearDedupe=YES, Email Gateway creates the same GCIDs for the two near dupe emails hence single email discovered in Enterprise Archive.
|
Microsoft Exchange Hybrid (BETA)
Deduplication in Microsoft Exchange Hybrid is available as BETA only on request.
A hybrid deployment offers organizations the ability to extend the feature-rich experience and administrative control they have with their existing on-premises Microsoft Exchange organization to the cloud. A hybrid deployment provides the seamless look and feel of a single Exchange organization between an on-premises Exchange organization and Exchange Online. In addition, a hybrid deployment can serve as an intermediate step to moving completely to an Exchange Online organization.
For more information, refer https://docs.microsoft.com/en-us/exchange/exchange-hybrid
NearDedupe: NO |
NearDedupe: YES |
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
Scenario-1: Cloud to the Cloud
Any situations where email is forked will lead to duplicate journaling, such as:
Transport chipping (NearDuplicate Journal messages when there are too many recipients on the message, 1000 recipients per one Journal message).
Incoming EmailThere will be more than one unique journal messages carrying the same original message inside, wrapper 1 will have info of the participants 1-1000; wrapper 2 will have the info from the participants 1001 and so on. DL1 in TO has more than thousand participant in it hence more than one Journal messages
|
|
NearDedupe: NO |
NearDedupe: YES |
View in Enterprise Archive - eDiscovery ApplicationMore than one Near-duplicate document discovered containing the same Subject, Body etc. discovered uniquely. EA Search results containing Near-duplicates:
|
View in Enterprise Archive - eDiscovery ApplicationSingle Near-duplicate document discovered containing the same Subject, Body etc. after being Near-Deduplicated. EA search results containing one message IDs:
|
GCIDs If NearDedupe=NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive. Examples Near-Dedupe message#1 with unique GCID
Near-Dedupe message#2 with unique GCID
|
GCIDs If NearDedupe=YES, Email Gateway creates same GCIDs for each Email hence they are discovered as single email in Enterprise Archive. Examples Message Near-Deduplicated in EA with same GCID for all documents
|
On-premises to Cloud
Once when on-premises journals and once when the cloud journals.
Incoming EmailEmail Gateway receives two Journal messages, the first one journaled by the Onprem Exchange server and the second one journaled by Exchange on the O365 cloud. Both the Journal messages carry the same message inside, the first journal body containing the Cloud user as TO and in the second journal body the OnPrem user being called out as recipient. Example Same message received twice due to Microsoft Journal behavior.
|
|
NearDedupe: NO |
NearDedupe: YES |
View in Enterprise Archive - eDiscovery ApplicationMore than one Near-duplicate document discovered containing the same Subject, Body etc. discovered uniquely. Enterprise Archive Search results containing Near-duplicates:
|
View in Enterprise Archive - eDiscovery ApplicationSingle Near-duplicate document discovered containing the same Subject, Body etc. after being Near-Deduplicated.
|
Participants Info Near-duplicate document Journaled from Exchange On-Prem
Near-duplicate document Journaled from O365 which added the onprem user is an additional participant
|
Participant Info All participants from Onprem and Cloud Exchange are captured in single document
|
GCIDs If NearDedupe=NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive. NearDedupe message#1 with unique GCID
NearDedupe message#2 with unique GCID
|
GCIDs If NearDedupe=YES, Email Gateway creates same GCIDs for each Email hence they are discovered as single email in Enterprise Archive. Message Deduped in Enterprise Archive with same GCIDs for all documents
|
Scenario-3: Cloud to On-premises
When messages sent by an Exchange Cloud users to a user in Exchange On-Prem, Microsoft journals such messages twice, one from On-Prem Exchange server and other from Cloud Exchange server.
Proofpoint URL Defense (BETA)
Deduplication in Proofpoint URL Defense is available as BETA only on request.
Proofpoint URL defense aka Targeted Attack Protection (TAP) URL Defense protects URL-based email threats including malware and credential phishing.
NearDedupe: NO |
NearDedupe: YES |
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:
|
Splitting behavior
For each local recipient of the message, TAP generates a per-user message (duplicate) with all internal URL modifications, etc.
The per-user message contains the X-URLDefense header with the value duplicate.
The per-user message is sent to a corresponding local recipient
Contains
X-URLDefense-Signature header value containing the signature of X-URLDefense and Message-ID.
X-Proofpoint-ORIG-GUID and X-Proofpoint-GUID value containing a unique checksum for each duplicate message
One canonical message is generated for the received message. The canonical message is the original message, with modifications to headers, Canonical message contains:
X-URLDefense header with value canonical.
X-URLDefense-Bcc header with a value containing all additional recipients not specified in To: or Cc:
X-URLDefense-Signature header with a value containing a signature of the X-URLDefense, X-URLDefense-Bcc and Message-ID.
The message is sent to the Proofpoint "nominated" inbox.
NearDedupe: NO |
NearDedupe: YES |
View in Enterprise Archive - eDiscovery Application
Previously Email Gateway would include Body content too while
calculating the checksum for the GCID and makes these emails unique once causing duplications in Enterprise Archive.
|
View in Enterprise Archive - eDiscovery ApplicationNow Email Gateway considers only Subject, From, MessageID, and the Date fields to NearDedupe the emails filtered from Proofpoint URL Defense, there by displaying only one email in Enterprise Archive.
|