NearDeduplication Behavior

NearDeduplication Behavior are listed for the following processing types:

Limitation

In case of emails journled from the Domino server, the dedupe functionality does not apply for participants (email IDs) listed in the Bcc field. This limitation applies to both internal and external participants in the Bcc field.

Journal Mails

In case of mass marketing, emails are sent to bulk users with external BCC participants which creates unique journal copies for the same email causing neardupes.

NearDedupe: NO

NearDedupe: YES

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  1. From the Journal Report (envelope)

    • FROM

    • TO

    • CC

    • BCC

  2. From the Original email (inside the envelope)

    • Subject

    • Plain Content

    • HTML Content

    • Attachments

    • messageID

    • Email Date

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

From the Original email (inside the envelope)

  • Subject

  • From

  • messageID

  • Email Date

Result

Unique GCIDs were created for the near-dup emails hence the Enterprise Archive searches results in unique email documents for near-dupe emails.

Result

Same GCIDs created for the near-dupe emails hence only one email document will be discoverable from the Enterprise Archive searches.

Use-Case

Incoming Emails

Two uniques Journal copies arrive at EGW (one with Bcc participants) wrapped inside with the same email

images/download/attachments/138521568/JournalMail_NearDedupe_Off.png

NearDedupe: NO

NearDedupe: YES

View in Enterprise Archive - eDiscovery Application

Two unique documents without deduplication of NearDedupe emails.

images/download/attachments/138521568/JournalMail_NearDedupe_Off_EA.png

View in Enterprise Archive - eDiscovery Application

Only one document discovered out of two NearDedupe emails ingested.

images/download/attachments/138521568/JournalMail_NearDedupe_On_EA.png

Participants

The original email wrapped inside the NearDedupe journals carries the same participant’s list as Enterprise Archive:

images/download/attachments/138521568/Participant_Off.png

Participants

The original email wrapped inside the NearDedupe journals carries the same participant’s list as Enterprise Archive:

images/download/attachments/138521568/Participant_Off.png

GCIDs

If NearDedupe is configured as NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive.

Sample Email 1

images/download/attachments/138521568/GDIC_Eg1.png

Sample Email 2

images/download/attachments/138521568/GDIC_Eg2.png

GCIDs

When NearDedupe is configured as YES, Email Gateway creates the same GCIDs for the two near dupe emails hence single email discovered in Enterprise Archive.

Sample Email

images/download/attachments/138521568/GDIC_Eg1_ND.png

Direct Mails

Direct messages are the ones that are not wrapped inside a journal envelope. Instead, are sent directly to Email Gateway.

NearDedupe: NO

NearDedupe: YES

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  1. From the Journal Report (envelope)

    • FROM

    • TO

    • CC

    • BCC

  2. From the Original email (inside the envelope)

    • Subject

    • Plain Content

    • HTML Content

    • Attachments

    • messageID

    • Email Date

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  • Subject

  • From

  • messageID from the email inside the journal envelope (P2 email)

  • Email Date from the email inside the journal envelope (P2 email)

Result

Unique GCIDs created for the near-dup emails hence the Enterprise Archive searches results in unique email documents for near-dupe emails

Result

Same GCIDs created for the near-dupe emails hence only one email document will be discoverable from the Enterprise Archive searches

Messages Extracted Directly from Users' Mailboxes

In some data migration use cases, emails can be directly extracted from users' mailboxes and archived in Enterprise Archive through Email Gateway.

Note

X-SMARSH-MAILBOX-OWNER is a custom header that will be added by Smarsh while extracting data from mailboxes and sent to Email Gateway. The header value is the user’s primary email address.


NearDedupe: NO

NearDedupe: YES

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  • FROM

  • TO

  • CC

  • Subject

  • Plain Content

  • HTML Content

  • Attachments

  • MailboxOwner X-Header

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  • Subject

  • From

  • messageID from the email inside the journal envelope (P2 email)

  • Email Date from the email inside the journal envelope (P2 email)

Mailbox Owner example use-case

Alex, a user sends an email to another user, Diego with the subject ‘Some Subject'. When their mailboxes are extracted, the Smarsh’s mailbox owner header "X-SMARSH-MAILBOX-OWNER" is added to the journal wrapper of each message extracted from each user’s mailbox. The email with the subject "Some Subject" is extracted twice, once from Alex’s and once from Diego’s mailboxes respectively.

Let’s see how the NearDedupe works in the case of the NO and YES settings.

NearDedupe: NO

NearDedupe: YES

Incoming Emails

Two unique Journal wrapper with the Mailbox owner XHeader containing the email with subject: 'Some Subject' arrive at EGW

  • Journal wrapper of Alex’s email will have X-SMARSH-MAILBOX-OWNER= alexw@****.com

  • Journal wrapper of Digo’s email will have X-SMARSH-MAILBOX-OWNER= diegos@****.com

View in Enterprise Archive - eDiscovery Application

Two unique emails discovered in Enterprise Archive for the emails extracted from Alex and Diego’s mailboxes:

images/download/attachments/138521568/UsersMailbox_Ediscovery_ND_Off.png

Incoming Emails

Two unique Journal wrapper with the Mailbox owner XHeader containing the email with subject: 'Some Subject' arrive at EGW

  • Journal wrapper of Alex’s email will have X-SMARSH-MAILBOX-OWNER= alexw@****.com

  • Journal wrapper of Digo’s email will have X-SMARSH-MAILBOX-OWNER= diegos@****.com

View in Enterprise Archive - eDiscovery Application

Single email discovered in Enterprise Archive for the emails extracted from Alex and Deigo’s mailboxes:

images/download/attachments/138521568/UsersMailbox_Ediscovery_ND_On.png

Participants

Near-duplicate document Journaled from Exchange On-Prem

images/download/attachments/138521568/Participants_1.png

Near-duplicate document Journaled from O365 which added the On-Prem user is an additional participant

images/download/attachments/138521568/Prticipants_3.png

Participants

All participants from On-Prem and Cloud Exchange are captured in single document

images/download/attachments/138521568/Participants_2.png

GCIDs

If NearDedupe=NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive.

images/download/attachments/138521568/EA_ND_Off_1.png

images/download/attachments/138521568/EA_ND_Off_2.png

GCIDs

If NearDedupe=YES, Email Gateway creates the same GCIDs for the two near dupe emails hence single email discovered in Enterprise Archive.

images/download/attachments/138521568/EA_ND_On.png

Microsoft Exchange Hybrid (BETA)

Deduplication in Microsoft Exchange Hybrid is available as BETA only on request.

A hybrid deployment offers organizations the ability to extend the feature-rich experience and administrative control they have with their existing on-premises Microsoft Exchange organization to the cloud. A hybrid deployment provides the seamless look and feel of a single Exchange organization between an on-premises Exchange organization and Exchange Online. In addition, a hybrid deployment can serve as an intermediate step to moving completely to an Exchange Online organization.

For more information, refer https://docs.microsoft.com/en-us/exchange/exchange-hybrid

NearDedupe: NO

NearDedupe: YES

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  • FROM

  • TO

  • CC

  • BCC

  • Subject

  • Plain Content

  • HTML Content

  • Attachments

  • messageID from the email inside the journal envelope (P2 email)

  • Email Date from the email inside the journal envelope (P2 email)

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  • FROM

  • Subject

  • messageID from the email inside the journal envelope (P2 email)

  • Email Date from the email inside the journal envelope (P2 email)

Scenario-1: Cloud to the Cloud

Any situations where email is forked will lead to duplicate journaling, such as:

Transport chipping (NearDuplicate Journal messages when there are too many recipients on the message, 1000 recipients per one Journal message).

Incoming Email

There will be more than one unique journal messages carrying the same original message inside, wrapper 1 will have info of the participants 1-1000; wrapper 2 will have the info from the participants 1001 and so on.

DL1 in TO has more than thousand participant in it hence more than one Journal messages

images/download/attachments/138521568/MSExhHyb_IncomingMail.png

NearDedupe: NO

NearDedupe: YES

View in Enterprise Archive - eDiscovery Application

More than one Near-duplicate document discovered containing the same Subject, Body etc. discovered uniquely.

EA Search results containing Near-duplicates:

images/download/attachments/138521568/MSExhHyb_EA1.png

View in Enterprise Archive - eDiscovery Application

Single Near-duplicate document discovered containing the same Subject, Body etc. after being Near-Deduplicated.

EA search results containing one message IDs:

images/download/attachments/138521568/MSExhHyb_EA2.png

GCIDs

If NearDedupe=NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive.

Examples

Near-Dedupe message#1 with unique GCID

images/download/attachments/138521568/ExchHyb_GCID_1.png

Near-Dedupe message#2 with unique GCID

images/download/attachments/138521568/ExchHyb_GCID_2.png

GCIDs

If NearDedupe=YES, Email Gateway creates same GCIDs for each Email hence they are discovered as single email in Enterprise Archive.

Examples

Message Near-Deduplicated in EA with same GCID for all documents

images/download/attachments/138521568/ExchHyb_GCID_3.png

On-premises to Cloud

Once when on-premises journals and once when the cloud journals.

Incoming Email

Email Gateway receives two Journal messages, the first one journaled by the Onprem Exchange server and the second one journaled by Exchange on the O365 cloud. Both the Journal messages carry the same message inside, the first journal body containing the Cloud user as TO and in the second journal body the OnPrem user being called out as recipient.

Example

Same message received twice due to Microsoft Journal behavior.

images/download/attachments/138521568/Scr_02.png

NearDedupe: NO

NearDedupe: YES

View in Enterprise Archive - eDiscovery Application

More than one Near-duplicate document discovered containing the same Subject, Body etc. discovered uniquely.

Enterprise Archive Search results containing Near-duplicates:

images/download/attachments/138521568/one.png

View in Enterprise Archive - eDiscovery Application

Single Near-duplicate document discovered containing the same Subject, Body etc. after being Near-Deduplicated.

images/download/attachments/138521568/two.png

Participants Info

Near-duplicate document Journaled from Exchange On-Prem

images/download/attachments/138521568/p1.png

Near-duplicate document Journaled from O365 which added the onprem user is an additional participant

images/download/attachments/138521568/p2.png

Participant Info

All participants from Onprem and Cloud Exchange are captured in single document

images/download/attachments/138521568/p3.png

GCIDs

If NearDedupe=NO, Email Gateway creates unique GCIDs for each Email hence they are discovered as two separate emails in Enterprise Archive.

NearDedupe message#1 with unique GCID

images/download/attachments/138521568/GCID1.png

NearDedupe message#2 with unique GCID

images/download/attachments/138521568/GCID2.png

GCIDs

If NearDedupe=YES, Email Gateway creates same GCIDs for each Email hence they are discovered as single email in Enterprise Archive.

Message Deduped in Enterprise Archive with same GCIDs for all documents

images/download/attachments/138521568/GCID3.png

Scenario-3: Cloud to On-premises

When messages sent by an Exchange Cloud users to a user in Exchange On-Prem, Microsoft journals such messages twice, one from On-Prem Exchange server and other from Cloud Exchange server.

Proofpoint URL Defense (BETA)

Deduplication in Proofpoint URL Defense is available as BETA only on request.

Proofpoint URL defense aka Targeted Attack Protection (TAP) URL Defense protects URL-based email threats including malware and credential phishing.

NearDedupe: NO

NearDedupe: YES

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  • FROM

  • TO

  • CC

  • BCC

  • Subject

  • Plain Content

  • HTML Content

  • Attachments

  • messageID from the email inside the journal envelope (P2 email)

  • Email Date from the email inside the journal envelope (P2 email)

The checksum in Enterprise Archive is recorded under a custom attribute, not in the GCID. GCID will become a random unique ID that Email Gateway constructs using:

  • FROM

  • Subject

  • messageID from the email inside the journal envelope (P2 email)

  • Email Date from the email inside the journal envelope (P2 email)

Splitting behavior

  • For each local recipient of the message, TAP generates a per-user message (duplicate) with all internal URL modifications, etc.

    • The per-user message contains the X-URLDefense header with the value duplicate.

    • The per-user message is sent to a corresponding local recipient

    • Contains

      • X-URLDefense-Signature header value containing the signature of X-URLDefense and Message-ID.

      • X-Proofpoint-ORIG-GUID and X-Proofpoint-GUID value containing a unique checksum for each duplicate message

  • One canonical message is generated for the received message. The canonical message is the original message, with modifications to headers, Canonical message contains:

    • X-URLDefense header with value canonical.

    • X-URLDefense-Bcc header with a value containing all additional recipients not specified in To: or Cc:

    • X-URLDefense-Signature header with a value containing a signature of the X-URLDefense, X-URLDefense-Bcc and Message-ID.

    • The message is sent to the Proofpoint "nominated" inbox.

Note

The Canonical copies are not sent to Smarsh for archival. Email Gateway receives only the duplicate copies and those are the ones that will be NearDeduplicated in Enterprise Archive.

images/download/attachments/138521568/MailfromPP.png

NearDedupe: NO

NearDedupe: YES

View in Enterprise Archive - eDiscovery Application

Previously Email Gateway would include Body content too while calculating the checksum for the GCID and makes these emails unique once causing duplications in Enterprise Archive.
images/download/attachments/138521568/PPMail_OLDMechanism.png

View in Enterprise Archive - eDiscovery Application

Now Email Gateway considers only Subject, From, MessageID, and the Date fields to NearDedupe the emails filtered from Proofpoint URL Defense, there by displaying only one email in Enterprise Archive.

images/download/attachments/138521568/PPMail_NewMechanism.png