Viewing Content Duplicates
Overview
Content Duplicates refers to a block of content that either completely match or is appreciably similar to the other. Such duplicate contents are displayed as a count in the related item's column when a search is performed across the application. This method of representing c ontent duplicates help reviewers or supervisors while reviewing the same copies of the archived document.
There are two types of Content Duplicates:
Exact Duplicates
Two identical emails are said to be exact duplicates when they have the exact same headers (From, To, Cc, Bcc, or any X-header), attachments, body or content, and size with 100 percent match between the two copies.
EGW performs deduplication in the exact duplicate records by constructing the same GCID for the two exact duplicate emails, as it has a 100 percent match in data. When a search is performed in Enterprise Archive for such records, all the documents that have any related items will be displayed along with the count of duplicate documents.
Near Duplicates
Two identical emails are said to be near-duplicates when they have the same body or content and attachments, but unique headers (From, To, Cc, Bcc, or any x-header) or minor changes between the headers.
Email Gateway relies on the metadata from the original emails (described in the deduping of exact duplications) for deduplication of near-duplicate records. The near deduplication feature can be turned on or off using the NearDedupe configuration available in EGW. Contact Smarsh Support to enable this configuration .
Note
By default, NearDedupe is enabled, that means de-duplication is applied to all emails ingested into Enterprise Archive from EGW.
EGW does the following when the NearDedupe configuration is turned on or off:
ON: Any two emails that are of the near-duplicate type with the same GCID will be deduplicated in the Enterprise Archive search results.
OFF: Any two emails that are of the near-duplicate type with unique GCIDs will be duplicated in the Enterprise Archive search results.
For more information, see NearDeduplication Behavior.
NearDupe Versioning
There are three versions of NearDedupe: v1, v2, and v3. Version v3 is the latest and most advanced version, offering enhanced features like saving and viewing duplicate content, and bulk tagging. This allows users to efficiently review and manage near-duplicate content. If you're using an older version of NearDupe (v1 or v2), you may not have access these enhanced features. To upgrade to NearDupe v3, and benefit from the latest enhancements, contact Smarsh Support.
Once enabled, the enhancement applies exclusively to new messages processed with NearDedupe v3.
The following table compares the features available in different NearDupe versions:
Feature |
Available in v1 |
Available in v2 |
Available in v3 |
Saving and viewing duplicate content |
No |
No |
Yes |
Review items individually |
No |
No |
Yes |
Individual item export |
No |
No |
Yes |
Bulk tagging |
No |
No |
Yes |
Viewing Content Duplicates for NearDupe V3
Content Duplication is applied across all three applications, Archive Management, Case Management, and Conduct. Duplicate documents are displayed as a count in the Related Items column in the search results page. The number represents the total number of duplicate copies of the archived documents.
Important
Search results will display all the documents including the duplicate documents, along with the count in the Related Items column.
For example if there are 3 duplicate documents that are identified, search results will display all three documents along with in the Related Items column, which indicates that there are three duplicate copies of the same document archived in Enterprise Archive.
Clicking the number under Related Items opens a Content Duplicates tab, where all the duplicate copies of that document are displayed. You may perform an export or any review actions from this tab.
Clicking the document from the Content Duplicates tab opens the document in View 2 and the Content Duplicates tab moves to the left. This enables you to view the other duplicate documents easily.
You can also perform bulk tagging of content duplicates, for more information, see Reviewing Content in Enterprise Archive.
Exporting nearDupe Contents
When exporting items that have Related Items associated with them, the related items are also exported. For example, if you export a document with five Related Items, the document and the five related items will be exported.