WORDSANDPHRASES
Introduction
WORDSANDPHRASES is comprised of sections that determine whether particular content must or must not exist before a policy will trigger an alert. Those sections are: MUSTANY, MUSTALL, MUSTNOT (MUSTx Section). MUSTx sections can be used individually or in conjunction with each other. Each MUSTx section contains a ZONES field.
ZONES fields are in turn comprised of TERMS and LEXICONS_LIST subfields that contain the criteria used to analyze the content of communications. Content is analyzed by matching content in the communication against content in ENTRIES present in the subfields of the ZONES field.
For the TERMS subfields, ENTRIES are embedded directly into the JSON policy itself; for the LEXICONS_LIST subfield, a reference embedded in the JSON policy points to an external text file that contains the Entries. Multiple lexicons can be used.
If content does not need to be analyzed, the WORDSANDPHRASES section is not included in a policy; instead, the FILTERS section is used to analyze contextual components of a communication. WORDSANDPHRASES can be used in conjunction with FILTERS.
ENTRIES
Entries are individual lines found in the subfields of ZONES and/or in a lexicon referenced by a LEXICONS_LIST operator . Each Entry is comprised of one or more TERMS, which can take the form of a Simple Term or Complex Term.
TERMS
Terms are strings of characters that can be either Simple or Complex. Entries are comprised of Terms.
Simple Terms
Searches using Entries comprised of a list of words and phrases are referred to as simple k eyword searches.
Example
Term |
Results |
gift |
Searches will only match for the word gift throughout the document. |
gift for you |
Searches will match throughout the communication or its attachments only on the exact phrase gift for you. |
Complex Terms
Complex Terms go beyond simple lists of words and phrases, using a variety of contextual syntax characters and operators to provide more precise matching capabilities.
Complex Terms |
Example |
Result |
Wildcards |
||
Search for multiple variations of a keyword using wildcard symbols. |
gift* |
Searches will match gift, gifts, gifted, gifting, or any word beginning with "gift" throughout the document. |
str?p |
Searches will match only five-letter words that start with "str" and end with "p", such as strap, strep, strip, and strop. |
|
Regular Expressions |
||
Search for a sequence of symbols or characters expressing a string or a pattern (regular expressions) like Social Security Numbers, Credit Card numbers, phone numbers, dates, IP Addresses, and so on. |
/^(?!000|666)[0-9]{3}([ -]?)(?!00)[0-9]{2}\1(?!0000)[0-9]{4}$/ |
Searches look for Social Security Numbers in the format of ###-##-#### or ### ## #### The hashtags stand for numbers where the first three digits are not 000 or 666, the second two digits are not 00, and the last four digits are not 0000. Searches will match numbers such as 144-94-8875 or 882-47-3337 |
Thank? for my /gift.|present./ |
Searches will match phrases such as:
It will NOT match Thank you for my gifts as "you" is not accounted for. |
|
Fuzzy Logic |
||
Fuzzy search helps you to search for the nearest word within a document. The logic in the use of the tilde can be described as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. Use of the tilde is ineffective with two or fewer characters. For three to five characters, the default maximum number of edits allowed is 1; for more than five characters, the maximum is 2. So "substitution~" will match on "substitutoin" because the "o" is deleted in "substitutoin" and then reinserted before the "n", that is, two edits. Although the number of edits allowed can be configured, it is recommended that the default value be used. |
free lunch~ |
Searches will match on phrases such as:
|
Proximity Operators |
||
Proximity operators are special syntax words that are used to combine simple and/or complex terms to allow for greater accuracy of content analysis by looking for the terms within a specific distance of each other. The operator word is followed by a comma and number where the digit represents the number of offsets that can exist between the terms. In addition to the description below of operators, also see Limitations in a Proximity Search . |
||
FOLLOWEDBY Signifies left-to-right directionality where the first Term must be followed by the second Term within the specified distance. Term1 FOLLOWEDBY,4 Term2 will match on the content of a communication so long as the first Term does not have more than four offsets between it and the second Term. In other words, there can be 0-4 words between the two Terms. |
guarantee FOLLOWEDBY,4 return* |
Examples that will hit:
Example that will not hit:
|
thank* FOLLOWEDBY,3 gift |
Searches will match on phrases such as:
Because the proximity distance is limited to three words, searches will NOT match on phrases such as:
Increasing the value of the number following "FOLLOWEDBY" will allow such phrases to be captured. Note: "thank FOLLOWEDBY,3 gift" will produce identical results as "gift PRECEDEDBY,3 thank" |
|
NOTFOLLOWEDBY Signifies left-to-right directionality where the first Term must not be followed by the second term within the specified distance. |
thank* NOTFOLLOWEDBY,3 gift |
Searches will match on words and phrases such as:
Searches will not match if "thank*" is followed by zero to three words by "gift". Consequently,
Note: "thank NOTFOLLOWEDBY,3 gift" will produce identical results as "gift NOTPRECEDEDBY,3 thank". |
PRECEDEDBY Signifies right-to-left directionality where the first Term must be preceded by the second Term within the specified distance. |
guarantee* PRECEDEDBY,6 return~ |
Searches will match on phrases such as:
Note the use of the tilde here for fuzzy matching. Because of the length of the root word, up to two edits can be made to constitute a match; therefore, the word "guarantee*" will match so long as no more than six words are between it and any of the following: return, returns, returned, return, riturn, retrun, etc. It will NOT match on "returning" because that would require 3 edits. To be able to match on "returning", the entry would have to be written as either:
|
NOTPRECEDEDBY Signifies right-to-left directionality where the first Term must not be preceded by the second term within the specified distance. |
golf outing NOTPRECEDEDBY,2 company sponsored |
Searches will match on phrases such as:
Searches will not match on:
|
NEAR Signifies bi-directionality where the first Term must appear either before or after the second Term within the specified distance. |
guarantee* NEAR,3 return* |
Searches will match on:
|
NOTNEAR Signifies bi-directionality where the first Term must not appear either before or after the second Term within the specified distance. |
guarantee* NOTNEAR,3 return* |
Searches will match on phrases such as:
Searches will not match on phrases such as:
|
Multiple Terms can be used on either or both sides of a syntax operator and are separated by a vertical pipe character "|", which is shorthand for a logical OR. Example: thank* | amazing FOLLOWEDBY,5 gift | present* The literal translation is: "think" or "amazing" followed by "gift" or "present" so long as there are no more than five words between the former and latter.Search will match the following phrases:
Proximity operators can be nested; however, parentheses must be used to group each level to avoid confusion. Operators are processed linearly left to right. Example: ((brown fox FOLLOWEDBY,2 jumps) NOTFOLLOWEDBY,3 lazy horse) NOTPRECEDEDBY,4 slow The above example first looks to see if the phrase "brown fox" is followed within two words by the word "jumps". It next makes sure that those words are not followed by "lazy horse", then ensure the words are also not preceded by the word "slow". The above example will trigger on
The above example will NOT trigger on
|
||
Excluding |
||
The EXCLUDING operator provides an exception to a term that will prevent an alert from being triggered |
out of the money EXCLUDING out of the money jar | finished out of the money |
Searches will match on phrases such as:
Searches will not mach on a phrase such as:
|
Lucene Syntax |
||
Lucene is a search engine library that is written in Java. Lucene code differs from that used by ElasticSearch. Lucene syntax cannot be combined with ElasticSearch syntax in a JSON file or a lexicon. Smarsh recommends that the use of Lucene syntax be restricted solely to policies created in the Advanced query section of the UI. |
Use of TERMS
The following is an example of how different Terms can be embedded in a policy. For a further explanation of the MUSTANY and ZONES operator in the example, see the MUSTx and ZONES sections respectively.
{
"WORDSANDPHRASES"
: {
"MUSTANY"
:{
"ZONES"
: [
"Subject"
,
"Body"
,
"File"
],
"TERMS"
: [
"strap"
,
"str?p"
,
"stock~"
"stock guarantee"
,
"stock guarant*"
"stock FOLLOWEDBY,2 guarant*"
"guarant* PRECEDEDBY,2 stock"
"(stock FOLLOWEDBY,2 guarantee) PRECEDEDBY,2 i | we"
]
}
}
}
LEXICONS_LIST
If the list of Entries in the TERMS subfield is extensive, one can use a LEXICONS_LIST in lieu of listing each Entry individually.
When the LEXICONS_LIST operator is used, th e name of an external lexicon is embedded in the JSON. The lexicon is simply a text file comprised of a series of Entries that is placed in the JSON rather than the Entries themselves. The external lexicon simply lists each Term as it would otherwise appear in the TERMS field if the Entries were embedded in the JSON policy. In technical terms, the names of the lexicons are provided as an array to the LEXICONS_LIST subfield of a MUSTx section.
For a further explanation of the MUSTANY operator in the following example, see MUSTx Sections Lexicons can be referenced by more than one policy.
IMPORTANT
The name of the lexicon file and its reference in the JSON file must be identical. Case sensitivity, spaces, and underscores are all applied literally. Because it is easy to mistakenly use two spaces between words instead of one, best practices dictate using an underscore to separate the words in the lexicon's name, for example, Lexicon_Gifts_and_Entertainment. To distinguish between policies and lexicons, it is suggested that all lexicons contain a Lexicons prefix as in the example.
LEXICONS_LIST Prequalifiers
Occasionally there is a need to use more than one LEXICONS_LIST in a single policy. The most common scenario in which more than one list is used is to create a "prequalifier" that looks to see if an entry on the first list is present before it moves to the next list to check for a match. If there is no match on the first list, processing ceases the the system moves to the next communication. Common policies that use prequalifiers are Gifts & Entertainment and Fair & Balanced communications. Prequalifiers help reduce false positive "noise" and improve policy accuracy. MINMATCH can also be used to help reduce the number of hits on a policy, but there are important caveats involved in doing so.
IMPORTANT
If someone edits a policy after it has been uploaded to change the name of a LEXICONS_LIST and the name of the list no longer matches the name of the lexicon actually uploaded in the list library, no error message will be displayed that indicates a mis-match exists; however, the policy will not run and the entire queue in which it is placed could be enabled.
Prior to making any change to the name of a LEXICONS_LIST, confirm the following points:
The name of the LEXICONS_LIST is changed;
Every other policy that refers to that lexicon must have the internal reference changed as well.
Example - Using Lexicon List
{
"WORDSANDPHRASES"
: {
"MUSTANY"
:{
"ZONES"
: [
"Subject"
,
"Body"
,
"File"
],
"LEXICONS_LIST"
: [
"Name_of_Lexicon_List"
]
}
}
}
MUSTx Sections
The MUSTx criteria determine whether or not the TERMS of an Entry must or must not appear in a communication, as well as with what frequency. The WORDSANDPHRASES section must contain at least one of the following occurrence MUSTx criteria:
MUSTANY |
This criterion implies that AT LEAST ONE of the entries in the WORDSANDPHRASES subsections must be present in an analyzed communication. MINMATCH can be used to increase the minimum number of entries that must be present before a policy will trigger. |
MUSTALL |
This criterion implies that ALL of the entries inside a WORDSANDPHRASES subsection must be present in any communication. If even one of the MUSTALL entries is missing, the policy will not trigger. |
MUSTNOT |
This criterion implies that NONE the entries inside a subsection must be present in each communication. Important If any MUSTNOT entry is matched, the entire communication will be excluded from analysis unless a MUSTHIT Policy is in use in the queue in which the policy is being used. |
MUSTHIT |
The MUSTHIT operator is used in a standalone ignore policy in which only the MUSTHIT entries (via the direct embedding of terms or via the use of a LEXICONS_LIST) are present. MUSTHIT policies are included in the same queue as flagging policies, but are executed first. If a match on the MUSTHIT policy is found, the matching communication is prevented from being excluded because it is set aside an no exclusionary operators can be applied to it. |
If multiple MUSTx sections are used, they are considered to be ANDed. So if a MUSTANY and MUSTNOT are present, both conditions must be satisfied before the policy will trigger.
MUSTx Criteria Rules
There is no maximum of the number of Entries that can be present in MUSTANY criteria.
A maximum of 50 entries can be present in MUSTALL criteria.
A maximum of 50 entries can be present in MUSTNOT criteria.
The maximum limit of 50 entries in each of the MUSTALL and MUSTANY criteria is cumulative. Although users can combine terms lists and lexicons, the combined total of entries cannot exceed 50.
An error message is displayed in Enterprise Archive if the list of terms exceeds 50 for the MUSTALL and/or MUSTNOT criteria.
Usage of MUSTx Selections
The following are a few examples of how the MUSTx values for WORDSANDPHRASES might be constructed. The open-close braces { } in the representations below are placeholders for the fields and subfields of the WORDSANDPHRASES ZONES that contain the information that define search criteria. The number of open braces must match the number of closed ones.
Syntax Representation 1 |
Syntax Representation 2 |
Syntax Representation 3 |
Syntax Representation 4 |
Syntax Representation 5 |
{ "WORDANDPHRASES": { "MUSTALL": {}, "MUSTNOT": {}, "MUSTANY": {} } } |
{ "WORDANDPHRASES": { "MUSTNOT": {}, "MUSTANY": {} } } |
{ "WORDANDPHRASES": { "MUSTALL": {}, "MUSTANY": {} } } |
{ "FILTERS": {}, "WORDANDPHRASES": { "MUSTANY": {} } } |
{ "WORDANDPHRASES": { "MUSTANY": {} }, "FILTERS": {} } |
ZONES Section
The ZONES field of the WORDSANDPHRASES section specifies to what parts of a communication a search should apply. The absence of a ZONES field implies that the complete set of ZONES will be searched.
The following are the ZONES in which a search is eligible to take place. Each ZONE defines a region of the communication that is a target for matching terms.
Subject – Performs a search for Entries in the communication's Subject.
Body – Performs a search for Entries in the communication's Body.
Policy – Performs a search of Policy Events and Attributes applied to communications.
Policy Events – Actions taken on Socialite before ingesting the communications into Enterprise Archive.
Attributes – Communications tagged with Custom Attributes.
Action – Performs a search of Action Events and Attributes applied to communications.
Action Events – Actions taken on Vantage before ingesting the communications into Enterprise Archive.
Attributes – Communications tagged with Custom Attributes.
File – Performs a search for Entries in the file attachments within the communication.
System – Performs a search for Entries in the System Generated messages within the communication.
A MUSTx ZONE criterion contains search targets such as attributes of a communication. For example, a policy might search only the subject of a communication, or perhaps the body and attachment.
USE OF ZONES
Reference
The colors below are placed for ease of following which sections belong together. The indents on braces and brackets are likewise meant to help understand where sections begin and end. In fact, no carriage returns are necessary when composing a JSON file.
The following sets of examples illustrate how to use ZONES with TERMS and LEXICONS_LIST
Example 1 shows how the ZONES field with a Terms subfield is constructed.
{ "ZONES": [ "Subject", "Body", "File" ] , "TERMS": [ "hello world", "guarantee*", "stock" ] } |
Example 2 shows how the construct is included in a MUSTx section.
{ "MUSTANY":{ "ZONES": [ "Subject", "Body", "File" ], "TERMS": [ "hello world", "guarantee*", "stock" ] } } |
Example 3 shows how the ZONES subfield is included in a MUSTx subfield of the WORDSANDPHRASES section.
{ " MUSTANY ":{ "ZONES": ["Subject", "Body", "File"], "TERMS ": [ "hello world", "guarantee*", "stock" ] } } } |
Example 4 shows how multiple MUSTx sections can be combined. Note that ZONES has been left out of the MUSTALL section. This means that all parts of a communication will be searched for the MUSTALL Entries.
{ "MUSTANY":{ "ZONES " : ["Subject", "Body", "File"], "TERMS": [ "hello world", "guarantee*", "stock"], " MUSTALL":{ " TERMS":[ "quick brown fox", "i assure you that" ] } } } } |
MINMATCH
MINMATCH can be used as a standalone operator or used in conjunction with a Prequalifier . The purpose of using the MINMATCH operator is to require more than one hit against entries because a single match would be too broad and would result in a significant number of false positives.
Example
{
"WORDSANDPHRASES": {
" MUSTANY ":{
"ZONES": [ "Subject", "Body", "File" ],
"LEXICONS_LIST": [
" Name_of_Lexicon_List" ],
"MINMATCH": 2
}
}
}
Caution
The MINMATCH score should never be artificially inflated just to reduce false positives. If a policy has a high false positive rate, the best-practices way to proceed is first to see if indexing company disclaimers will help reduce the numbers, followed by tuning entries to add or delete terms, particularly by using the EXCLUDING operator. Policies like Information Security should rarely, if ever, use MINMATCH; a single hit should be sufficient to trigger those policies. Indexing disclaimers and adding specific exclusions will help reduce false positives by an extremely significant amount.
Using Boolean Operators
MUSTx sectio ns can also be used in combination with Boolean operators such as AND and OR.
Two or more sections can be logically combined using AND and/or OR Boolean operators. Combination of Boolean operators behave as another section enabling nested Boolean operations on sections.
Boolean Logic Connectors
The following sets of examples illustrate how to use Boolean connectors.
Example 1 requires the presence of all three sections.
Example 2 requires the presence of one or more of the three sections
Example 3 requires the presence of the first three sections and either the fourth or fifth section
Example 1 |
Example 2 |
Example 3 |
{ |
{ |
{ |
The following are examples of using Boolean operations on MUSTx sections:
Reference
The colors below are placed for ease of following which sections belong together. The indents on braces and brackets are likewise meant to help understand where sections begin and end. In fact, no carriage returns are necessary when composing a JSON file.
Example 1 requires the communication to match at least one entry on Lexicon 1 and match at least one entry on Lexicon 2.
{ "MUSTANY": { "AND": [ { "LEXICONS_LIST": [ "Lexicon_1" ] }, { "LEXICONS_LIST": [ "Lexicon_2" ] } ] } } } |
Example 2 requires the communication to match all entries in Lexicon 1 and all entries in Lexicon 2.
{ "MUSTALL": { "AND": [ { "LEXICONS_LIST": [ "Lexicon_1"] }, { "LEXICONS_LIST": [ "Lexicon_2" ] } ] } } }
|
Example 3 looks at the subject and body of the communication to see if any of the embedded Terms are present and to see if the attachment contains any Entry in Lexicon 1 and/or Lexicon 2.
{ "WORDSANDPHRASES": { "MUSTANY": { "AND": [ { "ZONES": ["subject", "body" ],
"TERMS": [
"hello world", "guarantee", " stock" ] }, { "ZONES": ["file"], "LEXICONS_LIST": [ "Lexicon_1", "Lexicon_2" ] } ]
} } } |
Example 4 uses two different Boolean operators. The policy looks for a match of an Entry on Lexicon 1 and/or Lexicon 2 and (a) at least two entries on Lexicon 3 or (b) at least five entries on Lexicon 4. See MINMATCH for a discussion of how to require the presence of more than one entry before a hit will occur.
{ "WORDSANDPHRASES": { "MUSTANY": { "AND": [ { "LEXICONS_LIST": [ "name_of_lexicon_1" ] }, { "LEXICONS_LIST": [ "name_of_lexicon_2" ] }, { "OR": [ { "LEXICONS_LIST": [ "name_of_lexicon_3" ], "MINMATCH": 2
},
"LEXICONS_LIST": [ "name_of_lexicon_4" ], "MINMATCH": 5 } ] } ] } } } |