[Note: This topic is slated for removal. Refer also to the NLP++ section.]
NLP++™ is a novel programming language developed by Text Analysis International, Inc. It enables a developer to focus on the conceptual task of building a text analyzer while minimizing the need to delve into compiled programming languages. NLP++ combines a C++ -like syntax (called the code part) with a rule syntax (described in the Pass Files topic). The current topic is concerned primarily with the code part.
One use of NLP++ is dynamic semantics, that is, to create and use information "on the fly". Another is to embed heuristics and algorithms into the text analyzer under construction.
The NLP++ language can currently appear in the @CHECK, @CODE and @POST regions of a pass file. In a @CHECK region, the functions succeed and fail can be used to accept or reject a rule match, respectively.
The @PRE region, which operates during the pattern matching of a rule, does not yet allow NLP++ constructs.
WARNINGS:
[1] The developer must keep in mind that actions that modify the parse
tree, such as single(), will disable references to the nodes of a phrase
and otherwise affect the execution of NLP++ code. It is recommended that
tree-modifying actions follow references to node variables.
[2] If the developer must reference a newly created node, e.g., by the
group
action, he must manually figure out the new element number of the created
node. Note that following an action such as group, element numbers to the right of the grouped
node will also change.
See the Pass File section on Tokens.
NLP++ currently supports global variables and variables associated with particular nodes of the parse tree. The syntax for these variables is as follows:
VARIABLE |
DESCRIPTION |
G(varname) |
Global variable. |
S(varname) |
Variable belonging to the suggested concept of a rule. |
X(varname [,num]) |
Variable belonging to the numth context node starting at the root of the parse tree. Usually used to refer to the numth node of the @PATH select list. With @NODES, the preferred form is X(varname). |
N(varname [,num]) |
Variable belonging to a node that matched the numth element of a rule phrase. |
The shorthand notations N(varname) and X(varname) refer to the last element of the phrase and the last context node, respectively. The former is convenient for rules that have only one element in their phrase. The latter is convenient when using the @NODES selector. N(varname, 0) and X(varname, 0) have the identical meaning.
There is no notion of a local variable or pointer, at this time. S variables are often used as an idiom for local variables (the difference is that they "hang around" if a new node is suggested by the current rule).
NOTE: Variable names must be quoted. E.g., N("unknown word", 2)
WARNING: X(varname,num) does NOT refer to the numth member of a @NODES list. That list is not a path in the parse tree. We recommend using X(varname) with @NODES, to avoid "blindly" traversing down the parse tree. Using X(varname,2), for example, implies that you know what the nodes immediately below the root are. Such a use should be clearly documented.
To get useful information from the parse tree and other sources when analyzing a text, some special variable names have been provided. All start with a dollar sign ($). No such variable names are usable with suggested (S) variables, at present.
VARIABLE NAME |
VARIABLE TYPES |
DESCRIPTION |
$treetext |
N, X |
Fetch the 'cleaned up' (removing leading and trailing, convert separators to single space) text gleaned from a node's subtree. Uses the parse tree as input, not text buffer. Analogous to $text. Difference between $treetext and $text is $text returns text from the input text buffer while $treetext returns text from the parse tree. So, if you have removed some nodes from a tree with an excise(), $treetext returns the text minus the excised elements, while $text returns everything in the input text buffer, including the text under excised nodes. |
$text |
N, X |
Fetch the text covered by the node. Cleanup whitespace (removing leading and trailing, convert separators to single space). Uses the original text buffer, rather than the subtree under the node, in order to gather text. See also $treetext and $treeraw. |
$treeraw |
N, X |
Fetch the text covered by the node from the parse tree. Text is not 'cleaned up'. Analogous to $treetext. See also $treetext and $raw. |
$raw |
N, X |
Fetch the 'cleaned up' text covered by the node from the text buffer. See also $treetext, $treeraw and $text. |
$xmltext |
N, X |
Same as $raw, but converts characters that are special to HTML and XML. E.g., "<" is converted to "<". |
$length |
N, X |
Get the length of node's text. |
$ostart |
N, X |
Start offset of the referenced node in the input text. |
$oend |
N, X |
End offset of the referenced node in the input text. |
$start |
N, X |
Evaluates to 1 if the referenced node has no left sibling in the parse tree, else 0. |
$end |
N, X |
Evaluates to 1 if the referenced node has no right sibling in the parse tree, else 0. |
$apppath |
G |
Get base directory path for current application. E.g., "D:\dev\apps\Resume". |
$input |
G |
Get full input file. E.g., "D:\apps\Resume\input\Dev1\rez.txt" |
$inputpath |
G |
Get input file path. E.g., "D:\apps\Resume\input\Dev1" |
$inputname |
G |
Get input filename. E.g., "rez.txt" |
$inputhead |
G |
Get input file head. E.g., "rez" |
$inputtail |
G |
Get input file tail (i.e., extension). E.g., "txt" |
$allcaps |
N |
Returns 1 if the token underlying the node is all uppercase. If multiple words (even if all are all-caps), returns 0. WARN: Not identical to the uppercase() pre action. |
$lowercase |
N |
Returns 1 if the token underlying the node is all lowercase If multiple words (even if all are all-caps), returns 0. WARN: Not identical to the lowercase() pre action. |
$cap |
N |
Returns 1 if the token underlying the node is a capitalized word. WARN: Not identical to the cap() pre action. |
$mixcap |
N |
Returns 1 if the token underlying the node is a mixed-capitalized word. Examples: "MIchigan", "abcD". |
$exists |
N |
Returns 1 if the node exists, 0 otherwise. |
$unknown |
N |
Returns 1 if the token underlying the node is an unknown word. WARN: Not identical to the unknown() pre action. Requires a lookup() code action prior to any use of this special variable. |
NOTE:
[1] Special variable names must be quoted. E.g., N("$text", 2).
[2] You should not assign to or create your own "dollar" variables.
NLP++ reserves all variable names that start with a dollar sign.
[3] The $uppercase, $lowercase, $cap, and $unknown differ from the analogous
pre actions in that they all traverse down to the "leafiest"
node they can find, in order to perform their check. The corresponding
pre actions only chain down if the [s] flag is present in the associated
phrase element(s). Also, the $ variables are not stopped by a [base] attribute
in a node.
NLP++ can also reference the nodes of a parse tree directly. For example, N(2) refers to the (first) node that matched the 2nd element of a rule's right-hand-side phrase. X(3) refers to the 3rd node in the current context path as specified by @PATH. These constructs are used in functions that deal with the parse tree (see the Parse Tree Functions section below). (Note that knowledge base nodes are unrelated to parse tree nodes. The former are called NODE and the latter PNODE, as in the Types section below.)
The types of NLP++ variables available are shown in the following table.
NAME |
ABBREVIATION |
CHARACTERISTICS |
string |
STR |
Alpha-numerics: e.g. G("foobar") = "w3c"; |
integer |
INT |
32-bit integers: e.g. G("foorbar") = 32776; |
boolean |
BOOL |
1 or 0 (Boolean implemented as a numeric 1 for true and the numeric 0 for false.) |
ostream |
OSTREAM |
C++ output stream. fileout(filename) creates a variable with ostream value and filename << expr; prints by referring to the ostream value of G(filename). See also the ofile NLP++ function. |
concept |
CON |
Concepts are the main object in the kb and appear in the kb hierarchy. They have string names, and optionally have one or more attributes. e.g. G("myConcept") = makeconcept(G("root"),"My Name"); |
attribute |
ATTR |
Attributes are associated with concepts and kb nodes. They have string names, and optionally have one or more values. e.g. G("myAttribute") = addattr(G("myConcept"),"is"); |
value |
VAL |
Values are the values of kb attributes. They can be strings, integers or concepts: e.g. addconval(G("myConcept"), "is", "John Smith"); |
phrase |
PHR |
A phrase is a sequence of nodes owned by a single concept. E.g., create the first node of a phrase: G("USPhrase") = makephrase(G("US"),"United"); Add the last node of the phrase: addcnode(G("US"), "States"); |
node |
NODE |
Node in the KB. Differs from a concept in that it is not in the KB hierarchy, but rather in a concept's phrase. A node can serve as a "proxy" for a concept. |
array |
ARRAY |
The values of an array, or multi-valued variable, can be any single value (i.e., any type other than a nested ARRAY). |
pnode |
PNODE |
Node in the parse tree. Unrelated to a KB node. |
Arrays can consist of combinations of simple types. Nested array types are not implemented. An example follows.
# Create an array and step through it.
G("counter") = 0;
G("array_size") = 10;
# Arrays can optionally be "declared" by padding them with 0.
G("int_array")[G("array_size") - 1] = 0;
while ( G("counter") < G("array_size") ) {
G("int_array")[G("counter")] = G("counter");
"output.txt" << G("int_array")[G("counter")] << "\n";
G("counter") ++;
}
# print it out all at once
"output.txt" << G("int_array") << "\n";.
# copy it all at once:
G("another_array") = G("int_array");
Most NLP++ operators are modeled on the C programming language. The main addition is the confidence operator, %%. This enables convenient accumulation of evidence in terms of percent confidence. Following is an operator precedence table, with highest precedence at the top.
OPERATOR |
DESCRIPTION |
ASSOCIATIVITY |
++ -- |
Post increment, decrement |
Left to right |
++ -- |
Pre increment, decrement |
Right to left |
! |
Logical NOT (unary) |
Right to left |
+ - |
Unary plus, minus |
Right to left |
% * / %% |
Remainder, multiplication, division, confidence |
Left to right |
+ - |
Addition, subtraction |
Left to right |
< > <= == != >= |
Relational operators |
Left to right |
&& || |
Logical AND, OR |
Left to right |
= |
Assignment |
Right to left (multiple assignment works) |
*= /= += -= %%= |
Shorthand assignment (unimplemented) |
Right to left |
<< |
Output operator |
Left to right |
The << output operator is modeled on the C++ operator. Only the following form is supported, currently:
filename << expression << expression .... ;
Example:
"output.txt" << "Hello, NLP++!" << "\n";
NOTES: The file named by filename should have been created by a prior fileout(filename) code action. If that hasn't been done, then the output operator will create the global variable G(filename) with the output stream as its value. The default output mode for the file is append, which can't be modified by the user at present. Also, the default output directory is the log directory for the current input file -- also not under user control at present.
Say that something in the parse tree gives you 80% confidence that you've found a name. You could write this into a global variable as
G("name confidence") = 80;
Now you want to account for a new piece of evidence. Say the standalone confidence of that evidence is 70%. You could combine this as
G("name confidence") = G("name confidence") %% 70;
which will yield a new percent confidence between 80% and 100%.
The %% operator guarantees that "adding" or "subtracting" confidence will always stay within the range of -100 to +100, providing an intuitive way to gather evidence and compare confidences using a single scale of measurement. (Actually, %% bottoms out at 0%. A planned change to the operator will enable it to bottom out at -100%, which can be thought of as 100% confidence against something.)
NLP++ expressions are modeled on the C/C++ programming languages and can be nested arbitrarily and grouped using parentheses ( ).
One useful enhancement: addition with string arguments or string variables equates to string catenation.
NLP++ supports function calls. Actions, e.g., single, are specialized functions that don't take generalized NLP++ expressions as arguments (see Pass Files in the Developers section). Currently, new NLP++ functions must be defined with C++ code, within an application's user project. User-defined functions must be preceded by the 'user' scope qualifier, e.g., user::myfunction. (Interpreted NLP++ will find the correct function and issue a warning, but user functions without the scope qualifier will be compiled incorrectly.)
This too is modeled after the C programming language. Curly braces { and } delimit blocks of statements. Statements should end with a semicolon (;). A statement can be an arbitrary expression, a block, or fully nested if-else statements.
The only loop construct current supported is the while loop. Here's an example that prints out 0 thru 9:
G("counter") = 0;
G("loop_limit") = 10;
while( G("counter") < G("loop_limit")) {
"output.txt" << "Loop counter = " << G("counter") << "\n";
G("counter")++;
}
FUNCTION NAME |
RETURNS |
DESCRIPTION |
arraylength(VAR) |
INT |
Return number of values in given variable or expr. (Like a call-by-reference) |
lasteltnode(elt_num) |
PNODE |
Get the last node that matched the elt_numth element of a rule. |
phrasetext() |
STR |
Get the text that matched the right hand side phrase of a rule. Analogous to $text. |
phraseraw() |
STR |
Get the raw text that matched the right hand side phrase of a rule. Analogous to $raw |
pndeletechilds(pnode) |
None |
Delete children of given pnode. |
pnname(pnode) |
STR |
Return pnode's name. |
pnnext(pnode) |
PNODE |
Return pnode's right sibling, if any. |
pnprev(pnode) |
PNODE |
Return pnode's left sibling, if any. |
pnrename(pnode, str) |
STR |
Rename given pnode. Returns the interned name. |
pnroot() |
PNODE |
Return root of parse tree |
pnsingletdown(pnode) |
PNODE |
Return pnode's child, obeying rules about going down a "singlet chain." I.e., get child only if no branching and if not going past node with BASE flag set. |
pnup(pnode) |
PNODE |
Return pnode's parent, if any. Only leftmost pnode in a list has a parent. |
pnvar(pnode, var_str) |
(var values) |
Get pnode's variable value(s). |
varinlist(var_str, elt_num) |
(var values) |
Get var_str variable's value(s) in any pnode that matched elt_numth element |
FUNCTION NAME |
RETURNS |
DESCRIPTION |
strchar(str, num) |
STR |
Index to numth char of string |
strchr(str, ch_str) |
STR |
Find first occurrence of char in string. Return string headed by char. |
strclean(str) |
STR |
Remove leading, trailing, and repeat whitespace separators. |
strcontains(str1,str2) |
BOOL |
Is str1 contained in str2. |
strcontainsnocase(str1,str2) |
BOOL |
Is str1 contained in str2, ignoring letter case. |
strendswith(str, suffix_str) |
BOOL |
Does given str end with given suffix_str |
strequal(str1,str2) |
BOOL |
Is str1 identical to str2. |
strisalpha(str) |
BOOL |
Is str all alphabetic chars. |
strisdigit(str) |
BOOL |
Is str all numeric chars. |
strnotequal(str1,str2) |
BOOL |
Does str1 differ from str2. |
strequalnocase(str1,str2) |
BOOL |
Is str1 identical to str2, ignoring letter case. |
strnotequalnocase(str1,str2) |
BOOL |
Does str1 differ from str2, ignoring letter case. |
strlessthan(str1,str2) |
BOOL |
Is str1 lexically before str2. |
strgreaterthan(str1,str2) |
BOOL |
Is str1 lexically after str2. |
strlength(str) |
INT |
Return length of string. |
strpiece(str, start_num, end_num) |
STR |
Return substring from start_num to end_num indexes. Zero-based |
strrchr(str, ch_str) |
STR |
Find last occurrence of char in string. Return string headed by char |
strsubst(str, old_str, new_str) |
STR |
In str, substitute instances of old_str with new_str. |
strtolower(str) |
STR |
Return lowercase form of string. |
strtotitle(str) |
STR |
Capitalize string as in a title or header. |
strtoupper(str) |
STR |
Return uppercase form of string |
strtrim(str) |
STR |
Remove leading and trailing whitespace from str. |
unpackdirs(dir_str) |
STR ARRAY |
Unpack directory names from a full directory path string. |
strwrap(str, num) |
STR |
Fold string successively at every numth character. |
num(str) |
NUM |
Convert string to number, if possible. (Also accepts num arg). |
str(num) |
STR |
Convert numeric value to string. (Also accepts string arg). |
FUNCTION NAME |
RETURNS |
DESCRIPTION |
fail() |
None |
In @CHECK region, abort the rule that has just matched. Noop elsewhere. |
succeed() |
None |
In @CHECK and @CODE region, succeed without performing further code actions in that region. |
batchstart() |
BOOL |
True if current file is first in a batch run. |
debug() |
BOOL |
No op. Convenient breakpoint for debugging from C++ compiler. |
hitconf(hits_num, total_num, fudge_num) |
0 to 100 |
Percent confidence for hits. Fudge factor is typically 3 to 20. |
LJ(num, fieldsize_num) |
STR |
Left-justify num in given field size. |
ofile(path_str [, mode_str [, mode_str [, mode_str]]] ) |
OSTREAM |
Open an output file for the given path; analogous to C++ open file function. Zero to three mode strings selected from: 1) "app" = append; 2) "ate" = special C++ append; 3) "binary" = open in binary mode (text is default). WARNINGS: Default mode is OVERWRITE, not APPEND. Unlike fileout(), does not create a global variable for the file stream. [Upgraded doc to conform to standard C++ i/o libraries.] |
percentstr(numerator_num, denominator_num) |
STR (" 0" to "100") |
Format a percentage (right-justified in field of 3 chars). |
rightjustifynum(num, fieldsize_num) |
STR |
Format a percentage (right-justified in field of 3 chars). |
spellcandidates(word_str) |
STR |
Return list of space separated candidates for spell-correcting given word. |
spellcorrect(word_str) |
STR |
If known word, return as-is. If not, return best spelling guess. |
spellword(word_str) |
BOOL |
See if given word is a known English word. |
strspellcandidate(word_str, list_str) |
STR |
Choose best spell-correct candidate for a word, given a list of space-separated candidates (as returned by spellcandidates, e.g.) |
strspellcompare(str1, str2) |
INT |
Return spelling "distance" between two given words. The smaller the number, the fewer the corrections to get from one to the other. |
system(str) |
BOOL |
Execute str as operating system command |
Note: It is necessary to escape the directory delimiter backslash ('\') with another ('\'). For example, suppose you wanted to echo characters to a file and then copy that file to some other directory. Here's some example code:
@NODES _ROOT
@POST
# echo.txt is put in your default console location, which for
# me happens to be D:\WINNT\Profiles\Administrator\Desktop
system("echo this is a test > echo.txt");
# echo1.txt ends up in D:\WINNT\Profiles\Administrator\Desktop
system("copy echo.txt echo1.txt");
# echo2.txt ends up on D:
system("copy echo1.txt D:\\echo2.txt");
# echo3.txt ends up on D:\apps
system("copy D:\\echo2.txt D:\\apps\\echo3.txt");
"output.txt" << "copy.\n" ;
noop();
@@POST
@RULES
_xNIL <- _xALPHA @@