Viewing Content Duplicates

Overview

Content Duplicates refers to a block of content that either completely match or is appreciably similar to the other. Such duplicate contents are displayed as a count in the related item's column when a search is performed across the application. This method of representing c ontent duplicates help reviewers or supervisors while reviewing the same copies of the archived document.

Note

Content Duplicate is currently applicable only for Email data ingested through Email Gateway (EGW).
Email Gateway checks the uniqueness of an incoming email from the headers such as From, To, Cc, and Bcc and uses the emails' metadata to calculate a checksum which is used to construct a Global Communication ID (GCID) . Enterprise Archive uses the GCID received from Email Gateway to display the search results and suppress the duplicate email copies.

There are two types of Content Duplicates:

Exact Duplicates

Two identical emails are said to be exact duplicates when they have the exact same headers (From, To, Cc, Bcc, or any X-header), attachments, body or content, and size with 100 percent match between the two copies.

EGW performs deduplication in the exact duplicate records by constructing the same GCID for the two exact duplicate emails, as it has a 100 percent match in data. When a search is performed in Enterprise Archive for such records, all the documents that have any related items will be displayed along with the count of duplicate documents.

Near Duplicates

Two identical emails are said to be near-duplicates when they have the same body or content and attachments, but unique headers (From, To, Cc, Bcc, or any x-header) or minor changes between the headers.

Email Gateway relies on the metadata from the original emails (described in the deduping of exact duplications) for deduplication of near-duplicate records. The near deduplication feature can be turned on or off using the NearDedupe configuration available in EGW. Contact Smarsh Support to enable this configuration .

Note

By default, NearDedupe is enabled, that means de-duplication is applied to all emails ingested into Enterprise Archive from EGW.

EGW does the following when the NearDedupe configuration is turned on or off:

  • ON: Any two emails that are of the near-duplicate type with the same GCID will be deduplicated in the Enterprise Archive search results.

  • OFF: Any two emails that are of the near-duplicate type with unique GCIDs will be duplicated in the Enterprise Archive search results.

For more information, see NearDeduplication Behavior.

NearDupe Versioning

There are three versions of NearDedupe: v1, v2, and v3. Version v3 is the latest and most advanced version, offering enhanced features like saving and viewing duplicate content, and bulk tagging. This allows users to efficiently review and manage near-duplicate content. If you're using an older version of NearDupe (v1 or v2), you may not have access these enhanced features. To upgrade to NearDupe v3, and benefit from the latest enhancements, contact Smarsh Support. Once enabled, the enhancement applies exclusively to new messages processed with NearDedupe v3.

The following table compares the features available in different NearDupe versions:

Feature

Available in v1

Available in v2

Available in v3

Saving and viewing duplicate content

No

No

Yes

Review items individually

No

No

Yes

Individual item export

No

No

Yes

Bulk tagging

No

No

Yes

Viewing Content Duplicates for NearDupe V3

Content Duplication is applied across all three applications, Archive Management, Case Management, and Conduct. Duplicate documents are displayed as a count in the Related Items column in the search results page. The number represents the total number of duplicate copies of the archived documents.

Important

Search results will display all the documents including the duplicate documents, along with the count in the Related Items column.

For example if there are 3 duplicate documents that are identified, search results will display all three documents along withimages/download/attachments/138520180/RelatedItems_Count.jpg in the Related Items column, which indicates that there are three duplicate copies of the same document archived in Enterprise Archive.


images/download/attachments/138520180/RelatedItems.jpg

Clicking the number under Related Items opens a Content Duplicates tab, where all the duplicate copies of that document are displayed. You may perform an export or any review actions from this tab.

images/download/attachments/138520180/ContentDuplicates.jpg

Clicking the document from the Content Duplicates tab opens the document in View 2 and the Content Duplicates tab moves to the left. This enables you to view the other duplicate documents easily.

images/download/attachments/138520180/View_DuplicateDoc.jpg

You can also perform bulk tagging of content duplicates, for more information, see Reviewing Content in Enterprise Archive.

Exporting nearDupe Contents

When exporting items that have Related Items associated with them, the related items are also exported. For example, if you export a document with five Related Items, the document and the five related items will be exported.

Note

The Remove Duplicate Emails option is currently unavailable for nearDupe V3 version, when exporting items with Related Items. This option is available only for V1 and V2 versions.