Documents

Save PDF

Last UpdatedOct 31, 2025
5 minute read

The Gateway can extract different types of inputs files such as .pdf, .txt, rtf, .xls, .xlsx and .doc files.

This Extract component can read the data from File System or S3 Bucket sources.

The details of the input storage to be read by the project are defined in the extract configuration.

To specify how to extract the data:

Click Extract to display the Extract page.
Select one or more file formats from the Document Type drop-down box:
- Word (.doc and .docx)
- PDF (extensions containing "pdf")
- Text (.txt)
- RTF (.rtf)
- Excel (.xls and .xlsx).
  
  You can select either single or multiple document types from the Document Type box.
  
  The supported format for .xlsx files is Excel Workbook only. Hence strict OOXML format (also .xlsx) is not supported. Files in strict OOXML format can be converted to Excel Workbook format by opening them in Excel and saving them in Workbook format.
Select the type of Input Source from the drop-down box, that is, File System or S3.

Note: Only the text contents of Excel files are extracted. If you need to process the contents as a table, then Excel.

For File System:

Define the Input Path for the location of the input files, either typing directly or use the Browse File Path to select a single file or Browse Folder Path to select a folder of files.

Extract contents By

Select Extract contents By option to specify how the text in the input file(s) will be stored in the Object Model for subsequent transformation, either as one word per object, one line per object, or treat the whole text as a single object:
- Word
- Line
- Whole Text. By default, Whole Text is selected.

Note: If you select a folder of documents to process, then the Gateway generates output EIWM files, corresponding to each of the input files in the folder or in subfolders. These EIWM files will have the same names as the input file names.

Documents extraction of Document Type "Word"

The "Word" Document Type supports processing of .doc and .docx files as input. The Gateway requires a separate installation of an Index Filter (iFilter) to process these files. Ensure that the relevant iFilters installed with the Microsoft Office Filter Pack or Microsoft Office are present for full document format support:

OffFilt.dll → Used for extracting text from .doc files.
OFFFILTX.DLL → Used for extracting text from .docx files.

When installed, these are usually located at:

C:\Program Files\Common Files\Microsoft Shared\Filters for 64-bit or
C:\Program Files (x86)\Common Files\Microsoft Shared\Filters for 32-bit (from the Filter Pack) or
C:\Program Files\Microsoft Office\root\VFS\ProgramFilesCommonX64\Microsoft Shared\Filters) (from Office)

If these are missing, text extraction of .doc and/or .docx files will not work.

Notes:

.doc Files (Word 97–2003)
- These Windows or older Office versions often included OffFilt.dll (usually located at C:\Windows\System32) and may remain even after Office is uninstalled.
.docx Files (Word 2007 and later)
- Windows Server 2022 and older versions of Windows 10 include WordPad (usually located at C:\Program Files\Windows NT\Accessories) which comes with its own iFilter: WordpadFilter.dll. This provides basic text extraction support for Office Open XML formats like .docx. As a result, text extraction from .docx files may work "out of the box" on these systems.
- Note that in newer Windows versions (such as Windows 11 24H2 and later), WordPad and its filter are not included, hence the above iFilters will need to be installed.

Microsoft Office Filter Pack Download

The Gateway is compatible with the 64-bit and 32-bit versions of iFilters available in the Microsoft Office Filter Pack:

Base package:
- 64-bit: FilterPack64bit.exe
- 32-bit: FilterPack32bit.exe
Service Pack 2 (recommended):
- 64-bit: KB2687447
- 32-bit: KB2687447

Extracting Word/Line/Whole Text Positions During PDF File Extraction

When you extract the PDF file, you can also extract the X and Y positions of Word/line or whole text based on your selection (see Object Merge Mapping for more information). These positions are stored as X and Y extents attributes of each object in the Object Model:

#Xmin#
#Ymin#
#Xmax#
#Ymax#

These attributes will be represented in the output EIWM as characteristics of the text object, for example:

Use iFilter to Extract Embedded Excel Spreadsheet Contents from Word or PDF Files:

To extract the contents of an embedded Excel spreadsheet in a Word or PDF file, you need to install the relevant iFilter (for Word or PDF files) that is compatible with your server's operating system and set the attribute value IFilter apply="true" in the Extract configuration XML file. The default value is "false" as this allows for single words or lines of text to be searched and the X and Y values of the tag locations to be recorded, neither of which are available when using the iFilter mode. The Type attribute sets the specific document type to process using iFilter when the apply attribute is set to true. The document type can only be of Word and PDF type.

<DocumentFile Type="Word,PDF,Text">
<Input source="S3">
<S3>
<Authentication instance="false">
<CredentialFile path="C:\Users\user_name\.aws\credentials" profileName="3pg-dev" />
</Authentication>
<Region>eu-west-1</Region>
<BucketName>1ddatainput</BucketName>
<ObjectKey />
</S3>
</Input>
<ExtractBy>Word</ExtractBy>
<IFilter apply="false" Type="PDF,Word" />
</DocumentFile>
</Datasources>
</configuration>

Limitations of iFilter

If a PDF file contains tables, then it will not combine columns with space. This will lead to no spaces among the table content.
If a PDF file contains multiple pages, then it will not provide any differentiation among pages.
Extracted text from a PDF file will be represented as a single object so you can only extract contents by "Whole Text".
The X and Y positions of the text cannot be determined so will not be available as X Min and Max and Y Min and Max values in the EIWM objects.

For S3

An S3 Bucket is a container for objects stored in Amazon Web Services (AWS) S3 Bucket container.

To configure the extractor for an S3 Bucket as input source:

Follow the below procedures for Accessing an AWS S3 Bucket to use the S3 Credential Details. If you want to test the connection to the S3 Bucket, click Test Connection. The result of the test is displayed in the status bar.
You can select either single document or multiple documents in the Document Type for an S3 Bucket Object Key:

Single Document Selection: S3/Input folder or S3/Input/sampleword.doc file name provides the file that matches with the document type.

Multiple Documents Selection: S3/Input folder is allowed but no object key with extension is allowed.

To configure the extractor for S3 Bucket as input source:

Define the elements in S3 Credential Details section.
If you want to test connection to S3 Bucket, click Test Connection. The result of the test is displayed in the bottom left status bar.
Click Save Settings.

Configuration

<configuration xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" sourceProductName="AVEVA Gateway for 1D Data" componentName="Extract" componentVersion="2.8.0.0" >
<Datasources Type="Word,PDF,Text">
<Input source="FileSystem">
<S3>
<Authentication instanceProfile="false">
<CredentialFile path=" " profile=" " />
</Authentication>
<Region> </Region>
<BucketName> </BucketName>
<ObjectKey />
</S3>
<FileSystem>
<InputPath>C:\Users\ <Gateway User> \Desktop \review \Input \</InputPath>
</FileSystem>
</Input>
<ExtractBy>Line</ExtractBy>
</Datasources>
</configuration>

The Gateway extracts the configuration file containing a specific datasource from S3 repository.

AVEVA™ Gateway for 1D Data

Documents

Table of Contents

Documents

Extracting Word/Line/Whole Text Positions During PDF File Extraction

Use iFilter to Extract Embedded Excel Spreadsheet Contents from Word or PDF Files:

Limitations of iFilter

For S3

Configuration

In This Topic

Related Links