Masks and wildcards for extracting data

Save PDF

Last UpdatedNov 16, 2022
2 minute read

You split a message into fields based on the position of the field, using masks and wildcards to extract the desired data.

Mask characters

Characters used to specify masks.

Character	Matches
?	Any single character.
*	Zero or more characters.
#	Any single digit (0 – 9).
[character string]	Any single character in character string. Must be enclosed in square brackets.
[!character string]	Any single character not in character string. Must be enclosed in square brackets.
( )	Indicates the data to be extracted into the field.
\	To match characters that are used for filter tokens, for example, question marks, precede the character with a backslash.

To extract data based on starting and ending position, specify the range using the format Cx – Cy. For more flexibility, you can use masks and wildcards in conjunction with position specifiers.

Character extraction examples

Examples of extracting data from messages into fields.

Example	Description
FIELD(1) = C1 – C10	Extract the first ten characters from the input line.
FIELD(2) = C11 – C11(",")	Extract the field that starts at character 11 and ends before the next comma.
FIELD(3) = C11(",") – (",")	Extract the field that starts after the first comma after character 11 and ends before the next comma.
FIELD(4) = C31 – C41("[;,:]")	Extract the characters starting at position 31 up to (but not including) the first semi-colon, comma, or colon after position 41.
FIELD(5) = C51 – C51("[!0123456789]")	Extract characters starting at position 51 up to (but not including) the first non-numeric character after position 51.

For input formatted as orthogonal matrices of rows and columns in comma-separated .csv files, the simplest way to extract and assign values to individual fields without specifying the fixed numeric start and end positions is to use the following structure for defining the matrix. In this example, the separator is the semicolon and the construct expects exactly three columns in the input file. The notation with asterisk in parenthesis (*) denotes which part of the string will be assigned to the field on the left side of the construct.

FIELD(1) = [(*);*;*]
FIELD(2) = [*;(*);*]
FIELD(3) = [*;*;(*)]

The white space characters, space and tab, can be used as separators as well. For example:

FIELD(1) = ["(*) * *"]
FIELD(2) = ["* (*) *"]
FIELD(3) = ["* * (*)"]

In cases where there are many commas in the .csv file, you can reduce the effort to define the matrix by using a diagonal matrix. For example:

FIELD(1) = [(*);*]
FIELD(2) = [*;(*);*]
FIELD(3) = [*;*;(*);*]
FIELD(4) = [*;*;*;(*);*]

The final asterisk in each field definition addresses all commas after the field that is extracted, which helps avoid errors caused by missed commas.

Escape Character

An escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence, UFL uses the \ (backslash). For example, to extract a double-quoted field and strip the quotes, use backslash to escape the quotes:

FIELD(1) = ["*,*,*\"(*)\""]

To escape the asterisk, the following construct can be used:

FIELD(1) = ["*,*\*(*)\**"]

PI Interface for Universal File and Stream Loading UFL