WORDSANDPHRASES

Introduction

WORDSANDPHRASES is comprised of sections that determine whether particular content must or must not exist before a policy will trigger an alert. Those sections are: MUSTANY, MUSTALL, MUSTNOT (MUSTx Section). MUSTx sections can be used individually or in conjunction with each other. Each MUSTx section contains a ZONES field.

ZONES fields are in turn comprised of TERMS and LEXICONS_LIST subfields that contain the criteria used to analyze the content of communications. Content is analyzed by matching content in the communication against content in ENTRIES present in the subfields of the ZONES field.

For the TERMS subfields, ENTRIES are embedded directly into the JSON policy itself; for the LEXICONS_LIST subfield, a reference embedded in the JSON policy points to an external text file that contains the Entries. Multiple lexicons can be used.

If content does not need to be analyzed, the WORDSANDPHRASES section is not included in a policy; instead, the FILTERS section is used to analyze contextual components of a communication. WORDSANDPHRASES can be used in conjunction with FILTERS.

ENTRIES

Entries are individual lines found in the subfields of ZONES and/or in a lexicon referenced by a LEXICONS_LIST operator . Each Entry is comprised of one or more TERMS, which can take the form of a Simple Term or Complex Term.

TERMS

Terms are strings of characters that can be either Simple or Complex. Entries are comprised of Terms.

Note

Search for Terms is always case-insensitive. For regular expressions, except for case conversion, there is no other analysis applied.

Simple Terms

Searches using Entries comprised of a list of words and phrases are referred to as simple k eyword searches.

Example

Term	Results
gift	Searches will only match for the word gift throughout the document.
gift for you	Searches will match throughout the communication or its attachments only on the exact phrase gift for you.

Complex Terms

Complex Terms go beyond simple lists of words and phrases, using a variety of contextual syntax characters and operators to provide more precise matching capabilities.

Complex Terms	Example	Result
Wildcards
Search for multiple variations of a keyword using wildcard symbols. Note Wildcard characters must have a minimum of three characters at the beginning of the term in which they are being used.	`gift*`	Searches will match gift, gifts, gifted, gifting, or any word beginning with "gift" throughout the document.
	`str?p`	Searches will match only five-letter words that start with "str" and end with "p", such as strap, strep, strip, and strop.
Regular Expressions
Search for a sequence of symbols or characters expressing a string or a pattern (regular expressions) like Social Security Numbers, Credit Card numbers, phone numbers, dates, IP Addresses, and so on. Note Search term containing a blank space will be parsed as one term.	`/^(?!000\|666)[0-9]{3}([ -]?)(?!00)[0-9]{2}\1(?!0000)[0-9]{4}$/`	Searches look for Social Security Numbers in the format of `###-##-####` or `### ## ####` The hashtags stand for numbers where the first three digits are not 000 or 666, the second two digits are not 00, and the last four digits are not 0000. Searches will match numbers such as 144-94-8875 or 882-47-3337
	`Thank? for my /gift.\|present./`	Searches will match phrases such as: Thanks for my gift Thanks for my gifts Thanks for my present Thanks for my presents It will NOT match Thank you for my gifts as "you" is not accounted for.
Fuzzy Logic
Fuzzy search helps you to search for the nearest word within a document. Note Search term containing a blank space will be parsed as one term. The logic in the use of the tilde can be described as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. Use of the tilde is ineffective with two or fewer characters. The default maximum number of edits allowed for three to five characters, or for more than five characters, is 2. So "substitution~" will match on "substitutoin" because the "o" is deleted in "substitutoin" and then reinserted before the "n", that is, two edits. Although the number of edits allowed can be configured, it is recommended that the default value be used.	`free lunch~`	Searches will match on phrases such as: meet me for a free lunch meet me for a free luch [note the misspelling] free lunches join the free launch party for our new boat.
Proximity Operators
Proximity operators are special syntax words that are used to combine simple and/or complex terms to allow for greater accuracy of content analysis by looking for the terms within a specific distance of each other. The operator word is followed by a comma and number where the digit represents the number of offsets that can exist between the terms. In addition to the description below of operators, also see Limitations in a Proximity Search .
FOLLOWEDBY Signifies left-to-right directionality where the first Term must be followed by the second Term within the specified distance. `Term1 FOLLOWEDBY,4 Term2` will match on the content of a communication so long as the first Term does not have more than four offsets between it and the second Term. In other words, there can be 0-4 words between the two Terms.	`guarantee FOLLOWEDBY,4 return*`	Examples that will hit: I guarantee you will see wild returns They guarantee that you can return the item within 10 days Example that will not hit: We guarantee the stock will absolutely show you 10x returns
	`thank* FOLLOWEDBY,3 gift`	Searches will match on phrases such as: Thanks for the gift. Thankful for my gift. Thank you for my gift. Thankyou for my gift. (NOTE: thankyou is misspelled here as a single word; it matches because of the wildcard) Because the proximity distance is limited to three words, searches will NOT match on phrases such as: Thank you for my birthday gift. Thankful for my really lovely gift. Increasing the value of the number following "FOLLOWEDBY" will allow such phrases to be captured. Note: "thank FOLLOWEDBY,3 gift" will produce identical results as "gift PRECEDEDBY,3 thank"
NOTFOLLOWEDBY Signifies left-to-right directionality where the first Term must not be followed by the second term within the specified distance.	`thank* NOTFOLLOWEDBY,3 gift`	Searches will match on words and phrases such as: Thanks Thank you I want to say thanks again for lunch. I'm really thankful for a friend like you. Searches will not match if "thank" is followed by zero to three words by "gift". Consequently, "Thank you for the gift" will NOT match, but "Thank you for the very thoughtful gift" WILL match Note*: "thank NOTFOLLOWEDBY,3 gift" will produce identical results as "gift NOTPRECEDEDBY,3 thank".
PRECEDEDBY Signifies right-to-left directionality where the first Term must be preceded by the second Term within the specified distance.	`guarantee* PRECEDEDBY,6 return~`	Searches will match on phrases such as: A high return on your investment is guaranteed. Returns are covered by a money-back guarantee. Note the use of the tilde here for fuzzy matching. Because of the length of the root word, up to two edits can be made to constitute a match; therefore, the word "guarantee" will match so long as no more than six words are between it and any of the following: return, returns, returned, return, riturn, retrun, etc.* It will NOT match on "returning" because that would require 3 edits. To be able to match on "returning", the entry would have to be written as either: guarantee* PRECEDEDBY,6 return* guarantee* PRECEDEDBY,6 return~ \| returning (see Multiple Terms below for an explanation of the vertical pipe character)
NOTPRECEDEDBY Signifies right-to-left directionality where the first Term must not be preceded by the second term within the specified distance.	`golf outing NOTPRECEDEDBY,2 company sponsored`	Searches will match on phrases such as: The cost for the golf outing is on me. Let me take you and Bob for a golf outing. Searches will not match on: Let me take you and Bob to our company sponsored golf outing.
NEAR Signifies bi-directionality where the first Term must appear either before or after the second Term within the specified distance.	`guarantee* NEAR,3 return*`	Searches will match on: Returns are absolutely guaranteed. I guarantee you'll see great returns.
NOTNEAR Signifies bi-directionality where the first Term must not appear either before or after the second Term within the specified distance.	`guarantee* NOTNEAR,3 return*`	Searches will match on phrases such as: I guarantee you will have a great time at the outing. Bob guarantees that the documents will be sent on time. You have my guarantee. Searches will not match on phrases such as: Bob guaranteed to return the file when he's done. Returns on investments are guaranteed.
Multiple Terms can be used on either or both sides of a syntax operator and are separated by a vertical pipe character "`\|`", which is shorthand for a logical OR. Note There must be a space before and after the pipe. Example: `thank* \| amazing FOLLOWEDBY,5 gift \| present` The literal translation is: "think" or "amazing" followed by "gift" or "present" so long as there are no more than five words between the former and latter.Search will match the following phrases: Thank you for the wonderful gift Don't forget to thank Mom for the presents Thanks for today's great sales presentation Thankyou for such an expensive gift (Note: thankyou is misspelled here as a single word; it matches because of the wildcard) What an amazing and thoughtful gift Proximity operators can be nested; however, parentheses must be used to group each level to avoid confusion. Operators are processed linearly left to right. Example*: `((brown fox FOLLOWEDBY,2 jumps) NOTFOLLOWEDBY,3 lazy horse) NOTPRECEDEDBY,4 slow` The above example first looks to see if the phrase "brown fox" is followed within two words by the word "jumps". It next makes sure that those words are not followed by "lazy horse", then ensure the words are also not preceded by the word "slow". The above example will trigger on Quick brown fox jumps over the lazy dog Fast brown fox always jumps over the horse The above example will NOT trigger on Slow brown fox jumps over the lazy dog Quick brown fox jumps over the lazy horse
Excluding
The EXCLUDING operator provides an exception to a term that will prevent an alert from being triggered Note Since pipe character `\|` is also used for alternating characters of a proximity search, whenever a proximity term with alternating terms is used in the EXCLUDING clause, parenthesis should be used to make the meaning of pipe character explicit.	`out of the money EXCLUDING out of the money jar \| finished out of the money`	Searches will match on phrases such as: I want you to take my holdings out of the money market fund. Searches will not mach on a phrase such as: The horse finished out of the money Take a dollar out of the money jar.
Lucene Syntax
Lucene is a search engine library that is written in Java. Lucene code differs from that used by ElasticSearch. Lucene syntax cannot be combined with ElasticSearch syntax in a JSON file or a lexicon. Smarsh recommends that the use of Lucene syntax be restricted solely to policies created in the Advanced query section of the UI.

Use of TERMS

The following is an example of how different Terms can be embedded in a policy. For a further explanation of the MUSTANY and ZONES operator in the example, see the MUSTx and ZONES sections respectively.

{
"WORDSANDPHRASES": {
	 "MUSTANY":{
			"ZONES": [
			"Subject",
			 "Body",
			 "File"
		], 
				"TERMS": [
 						"strap",
 						"str?p",
						"stock~"
						"stock guarantee",
						"stock guarant*"
						"stock FOLLOWEDBY,2 guarant*"
						"guarant* PRECEDEDBY,2 stock"
						"(stock FOLLOWEDBY,2 guarantee) PRECEDEDBY,2 i | we"
						 ]
				}
		}
}

LEXICONS_LIST

If the list of Entries in the TERMS subfield is extensive, one can use a LEXICONS_LIST in lieu of listing each Entry individually.

When the LEXICONS_LIST operator is used, th e name of an external lexicon is embedded in the JSON. The lexicon is simply a text file comprised of a series of Entries that is placed in the JSON rather than the Entries themselves. The external lexicon simply lists each Term as it would otherwise appear in the TERMS field if the Entries were embedded in the JSON policy. In technical terms, the names of the lexicons are provided as an array to the LEXICONS_LIST subfield of a MUSTx section.

For a further explanation of the MUSTANY operator in the following example, see MUSTx Sections Lexicons can be referenced by more than one policy.

IMPORTANT

The name of the lexicon file and its reference in the JSON file must be identical. Case sensitivity, spaces, and underscores are all applied literally. Because it is easy to mistakenly use two spaces between words instead of one, best practices dictate using an underscore to separate the words in the lexicon's name, for example, Lexicon_Gifts_and_Entertainment. To distinguish between policies and lexicons, it is suggested that all lexicons contain a Lexicons prefix as in the example.

LEXICONS_LIST Prequalifiers

Occasionally there is a need to use more than one LEXICONS_LIST in a single policy. The most common scenario in which more than one list is used is to create a "prequalifier" that looks to see if an entry on the first list is present before it moves to the next list to check for a match. If there is no match on the first list, processing ceases the the system moves to the next communication. Common policies that use prequalifiers are Gifts & Entertainment and Fair & Balanced communications. Prequalifiers help reduce false positive "noise" and improve policy accuracy. MINMATCH can also be used to help reduce the number of hits on a policy, but there are important caveats involved in doing so.

Note

You will not be able to upload policy containing a LEXICONS_LIST operator unless you have previously uploaded the referenced lexicon into the Library List of the system. An error message will appear in the UI indicating the name of the missing lexicon. See Configuring List Library for more details how to upload a Lexicon to the List Library.

IMPORTANT

If someone edits a policy after it has been uploaded to change the name of a LEXICONS_LIST and the name of the list no longer matches the name of the lexicon actually uploaded in the list library, no error message will be displayed that indicates a mis-match exists; however, the policy will not run and the entire queue in which it is placed could be enabled.

Prior to making any change to the name of a LEXICONS_LIST, confirm the following points:

The name of the LEXICONS_LIST is changed;
Every other policy that refers to that lexicon must have the internal reference changed as well.

Example - Using Lexicon List

{
	"WORDSANDPHRASES": {
		 "MUSTANY":{
				"ZONES": [
					"Subject",
					 "Body",
					 "File"
				], 
						"LEXICONS_LIST": [
								 "Name_of_Lexicon_List"
						 ]
				}
		}
}

MUSTx Sections

The MUSTx criteria determine whether or not the TERMS of an Entry must or must not appear in a communication, as well as with what frequency. The WORDSANDPHRASES section must contain at least one of the following occurrence MUSTx criteria:

MUSTANY	This criterion implies that AT LEAST ONE of the entries in the WORDSANDPHRASES subsections must be present in an analyzed communication. MINMATCH can be used to increase the minimum number of entries that must be present before a policy will trigger.
MUSTALL	This criterion implies that ALL of the entries inside a WORDSANDPHRASES subsection must be present in any communication. If even one of the MUSTALL entries is missing, the policy will not trigger.
MUSTNOT	This criterion implies that NONE the entries inside a subsection must be present in each communication. Important If any MUSTNOT entry is matched, the entire communication will be excluded from analysis unless a MUSTHIT Policy is in use in the queue in which the policy is being used.
MUSTHIT	The MUSTHIT operator is used in a standalone ignore policy in which only the MUSTHIT entries (via the direct embedding of terms or via the use of a LEXICONS_LIST) are present. MUSTHIT policies are included in the same queue as flagging policies, but are executed first. If a match on the MUSTHIT policy is found, the matching communication is prevented from being excluded because it is set aside an no exclusionary operators can be applied to it.

If multiple MUSTx sections are used, they are considered to be ANDed. So if a MUSTANY and MUSTNOT are present, both conditions must be satisfied before the policy will trigger.

Note

There can be only one of each type of MUSTx in a policy, that is, although you can have all three MUSTANY, MUSTALL, and MUSTNOT Zones, you cannot try to use a Boolean connector to combine two MUSTANY Zones. See Using Boolean Operations for valid examples of the use of Boolean connectors.

MUSTx Criteria Rules

There is no maximum of the number of Entries that can be present in MUSTANY criteria.
A maximum of 50 entries can be present in MUSTALL criteria.
A maximum of 50 entries can be present in MUSTNOT criteria.
The maximum limit of 50 entries in each of the MUSTALL and MUSTANY criteria is cumulative. Although users can combine terms lists and lexicons, the combined total of entries cannot exceed 50.
An error message is displayed in Enterprise Archive if the list of terms exceeds 50 for the MUSTALL and/or MUSTNOT criteria.

Usage of MUSTx Selections

The following are a few examples of how the MUSTx values for WORDSANDPHRASES might be constructed. The open-close braces { } in the representations below are placeholders for the fields and subfields of the WORDSANDPHRASES ZONES that contain the information that define search criteria. The number of open braces must match the number of closed ones.

Note

Syntax Representations 4 and 5 below illustrate how to combine WORDSANDPHRASES with FILTERS.

Further note how the two sections are separated by a comma. When a policy is downloaded directly from an environment in which it has been loaded the FILTERS section appears before the WORDSANDPHRASES section, even if the file had originally been uploaded with the FILTERS and WORDSANDPHRASES in the reverse order.

Syntax Representation 1

Syntax Representation 2

Syntax Representation 3

Syntax Representation 4

Syntax Representation 5

{

"WORDANDPHRASES": {

"MUSTALL": {},

"MUSTNOT": {},

"MUSTANY": {}

}

{

"WORDANDPHRASES": {

"MUSTNOT": {},

"MUSTANY": {}

}

{

"WORDANDPHRASES": {

"MUSTALL": {},

"MUSTANY": {}

}

{

"FILTERS": {},

"WORDANDPHRASES": {

"MUSTANY": {}

}

{

"WORDANDPHRASES": {

"MUSTANY": {}

},

"FILTERS": {}

}

ZONES Section

The ZONES field of the WORDSANDPHRASES section specifies to what parts of a communication a search should apply. The absence of a ZONES field implies that the complete set of ZONES will be searched.

The following are the ZONES in which a search is eligible to take place. Each ZONE defines a region of the communication that is a target for matching terms.

Subject – Performs a search for Entries in the communication's Subject.
Body – Performs a search for Entries in the communication's Body.
Policy – Performs a search of Policy Events and Attributes applied to communications.
- Policy Events – Actions taken on Socialite before ingesting the communications into Enterprise Archive.
- Attributes – Communications tagged with Custom Attributes.
Action – Performs a search of Action Events and Attributes applied to communications.
- Action Events – Actions taken on Vantage before ingesting the communications into Enterprise Archive.
- Attributes – Communications tagged with Custom Attributes.
File – Performs a search for Entries in the file attachments within the communication.
System – Performs a search for Entries in the System Generated messages within the communication.

A MUSTx ZONE criterion contains search targets such as attributes of a communication. For example, a policy might search only the subject of a communication, or perhaps the body and attachment.

USE OF ZONES

Note

When using simple or complex terms in ZONES, each term entry must be enclosed in quotation marks and a comma must follow all but the last entry.

When using multiple MUSTx sections, there is an implied Boolean AND between them. In other words, the criteria of all sections must be satisfied for the policy to be triggered.

ZONES also can be combined with a Boolean AND or OR. See the section below on Boolean operators.

Reference

The colors below are placed for ease of following which sections belong together. The indents on braces and brackets are likewise meant to help understand where sections begin and end. In fact, no carriage returns are necessary when composing a JSON file.

The following sets of examples illustrate how to use ZONES with TERMS and LEXICONS_LIST

Example 1 shows how the ZONES field with a Terms subfield is constructed.

Note

This example will not work on its own because the WORDSANDPHRASES field is required.

{

"ZONES": [ "Subject", "Body", "File" ] ,

"TERMS": [

"hello world",

"guarantee*",

"stock" ]

}

Example 2 shows how the construct is included in a MUSTx section.

Note

This example will not work on its own because the WORDSANDPHRASES field is required.

{

"MUSTANY":{

"ZONES": [ "Subject", "Body", "File" ],

"TERMS": [

"hello world",

"guarantee*",

"stock" ]

}

Example 3 shows how the ZONES subfield is included in a MUSTx subfield of the WORDSANDPHRASES section.

{
"WORDSANDPHRASES": {

" MUSTANY ":{

"ZONES": ["Subject", "Body", "File"],

"TERMS ": [

"hello world",

"guarantee*",

"stock" ]

}

Example 4 shows how multiple MUSTx sections can be combined. Note that ZONES has been left out of the MUSTALL section. This means that all parts of a communication will be searched for the MUSTALL Entries.

{
"WORDSANDPHRASES": {

"MUSTANY":{

"ZONES " : ["Subject", "Body", "File"],

"TERMS": [

"hello world",

"guarantee*",

"stock"],

" MUSTALL":{

" TERMS":[

"quick brown fox",

"i assure you that" ]

}

MINMATCH

MINMATCH can be used as a standalone operator or used in conjunction with a Prequalifier . The purpose of using the MINMATCH operator is to require more than one hit against entries because a single match would be too broad and would result in a significant number of false positives.

Note

The value following MINMATCH is not enclosed by brackets, braces, or quotation marks. The value determines how many times one or more entries must match for a hit to occur. The default MINMATCH value is 1, so the MINMATCH operator should be excluded unless a value greater than one is desired. In the example below, the value is 2 so two or more Entries in Lexicon 1 must be present in the communication.

Example

{

"WORDSANDPHRASES": {

" MUSTANY ":{

"ZONES": [ "Subject", "Body", "File" ],

"LEXICONS_LIST": [

" Name_of_Lexicon_List" ],

"MINMATCH": 2

}

Caution

The MINMATCH score should never be artificially inflated just to reduce false positives. If a policy has a high false positive rate, the best-practices way to proceed is first to see if indexing company disclaimers will help reduce the numbers, followed by tuning entries to add or delete terms, particularly by using the EXCLUDING operator. Policies like Information Security should rarely, if ever, use MINMATCH; a single hit should be sufficient to trigger those policies. Indexing disclaimers and adding specific exclusions will help reduce false positives by an extremely significant amount.

Using Boolean Operators

MUSTx sectio ns can also be used in combination with Boolean operators such as AND and OR.

Two or more sections can be logically combined using AND and/or OR Boolean operators. Combination of Boolean operators behave as another section enabling nested Boolean operations on sections.

Boolean Logic Connectors

The following sets of examples illustrate how to use Boolean connectors.

Example 1 requires the presence of all three sections.

Example 2 requires the presence of one or more of the three sections

Example 3 requires the presence of the first three sections and either the fourth or fifth section

Example 1	Example 2	Example 3
{ "AND" : [ <section1>, <section2>, <section3> ] }	{ "OR" : [ <section1>, <section2>, <section3> ] }	{ "AND" : [ <section1>, <section2>, <section3>, { "OR" : [ <section4>, <section5> ] } ] }

The following are examples of using Boolean operations on MUSTx sections:

Note

In a Lexicon policy, the MUSTx is applied to the sections first and then Boolean logic is considered.

Reference

The colors below are placed for ease of following which sections belong together. The indents on braces and brackets are likewise meant to help understand where sections begin and end. In fact, no carriage returns are necessary when composing a JSON file.

Example 1 requires the communication to match at least one entry on Lexicon 1 and match at least one entry on Lexicon 2.

{
"WORDSANDPHRASES": {

"MUSTANY": {

"AND": [

{

"LEXICONS_LIST": [

"Lexicon_1" ]

},

{

"LEXICONS_LIST": [

"Lexicon_2" ]

}

]

}

Example 2 requires the communication to match all entries in Lexicon 1 and all entries in Lexicon 2.

{
"WORDSANDPHRASES": {

"MUSTALL": {

"AND": [

{

"LEXICONS_LIST": [

"Lexicon_1"]

},

{

"LEXICONS_LIST": [

"Lexicon_2" ]

}

]

}

Example 3 looks at the subject and body of the communication to see if any of the embedded Terms are present and to see if the attachment contains any Entry in Lexicon 1 and/or Lexicon 2.

{

"WORDSANDPHRASES": {

"MUSTANY": {

"AND": [

{

"ZONES": ["subject", "body" ],

"TERMS": [

"hello world",

"guarantee",

" stock" ]

},

{

"ZONES": ["file"],

"LEXICONS_LIST": [

"Lexicon_1",

"Lexicon_2" ]

}

]

}

Example 4 uses two different Boolean operators. The policy looks for a match of an Entry on Lexicon 1 and/or Lexicon 2 and (a) at least two entries on Lexicon 3 or (b) at least five entries on Lexicon 4. See MINMATCH for a discussion of how to require the presence of more than one entry before a hit will occur.

{

"WORDSANDPHRASES": {

"MUSTANY": {

"AND": [

{

"LEXICONS_LIST": [

"name_of_lexicon_1" ]

},

{

"LEXICONS_LIST": [

"name_of_lexicon_2" ]

},

{

"OR": [

{

"LEXICONS_LIST": [

"name_of_lexicon_3" ],

"MINMATCH": 2

},
{

"LEXICONS_LIST": [

"name_of_lexicon_4" ],

"MINMATCH": 5

}

]

}

]

}