Kryloff Technologies KT Text Filters
Search the site for: 
Subject Search Spider
Subject Search Scanner
Subject Search Pad
Subject Search Summarizer
Subject Search Siter
Subject Search Sleuth
Subject Search Server
KT Text Filters
   Download
   License Agreement ...
   Purchase a license

Buy a license to use KT Text Filters

KT Text Filters extract plain and Unicode textual contents from numerous file types such as, DOC, RTF, HTM, HTML, PDF, XLS, XML, PPT, HLP, TXT, etc. Text extraction is normally required for information retrieval search engines such as, SSScanner™ of ours. Text Filters may function as software components both for Kryloff's and third parties' products*; additional information about obtaining a license to use KT Text Filters in your software products is provided below.

Kryloff Technologies ships its products with KT Text Filters that are available at time of purchase. As soon as a new filter appears, we make it available for downloading to let our products search better and cover more file types. Look at the Free Stuff page to check out if you already have all the Text Filters that are currently available. Additionally, advanced users may configure SSScanner™, SSSleuth™ SSPad™, SSSiter™ and SSSpider™ to search through virtually any files. This document explains how to create custom Text Filters of your own.

*Important: KT Text Filters are provided as free components for Kryloff's products only!
If you want to use them in your own software products, you must obtain a license. Apart from the right to use our filters in and distribute them along with your software products world-wide on the royalty-free basis, upon purchase a license you will be provided with full documentation disclosing additional capabilities of the filters that are not documented on this page, and also sample code in C++, C#, Delphi for Win32 and .NET, Visual Basic for Windows and for .NET demonstrating the use of our filters. Also, you will be enrolled into our technical support which, in particular, includes possible bug fixes. If you have not yet obtained a license from us, you may not use and distribute any of the filters nor you may count on our technical support. See also: KT Text Filters End-User License Agreement.

Table of Contents
  1. Introduction
  2. Programming and Installing Text Filters

1. Introduction

This document explains how KT Text Filters are used by and communicate with Subject Search Scanner™, one of the Kryloff Technologies software products. The rest of Kryloff products also use filters; the way they communicate with KT Text Filters is approximately the same.

Subject Search Scanner™ has been developed to search through files of different types such as Text Files (TXT), HTML Files (HTM), Rich Text Files (RTF) and others. The data storage formats of all these file types are different: along with textual data, some of them may contain images or sounds or can even store encrypted data. As the SSScanner smart search engine is able to process textual data only, Kryloff Technologies ships this (and the rest of its products) with a number of KT Text Filters. The primary designation of any Text Filter is to perform extraction of textual data stored in source files and to pass it onto SSScanner (i.e., onto the calling application, process or thread). Upon receiving data from the filter, SSScanner analyses them, then selects and reports with the most relevant quotations. So, the search process for SSScanner looks like it reads input data from text files only regardless of their internal architecture. The actual data flow is shown below:

Data Flow in SSScanner
Pic. 1. Data Flow in SSScanner

Since different companies or individuals keep their data in too many different formats, you may face a situation when Kryloff Technologies has not included the required Text Filter into the original shipment, and SSScanner does not process some of your files correctly. Even in this case SSScanner is your solution as it is fully configurable to search through any files!

We recommend that you first, look attentively at the list of filters shown on the Text Filters page of SSScanner:

Text Filters in SSScanner
Pic. 2. Text Filters in SSScanner

If the required text filter is in list, make sure it is enabled (the Action column indicates "Apply"). If files of your type are not processed by any of the filters, visit the SSScanner product page to check if the filter you need has been developed by Kryloff Technologies or a third company and it is now available for free downloads. If you have obtained a filter, simply place its file(s) into the "Filters" subfolder of the SSScanner root folder, then re-start SSScanner. After it, the filter should appear in the list and all you have to do now to complete adding a new filter is to make sure it is enabled. Once you've done it, SSScanner keeps your settings for next sessions.

Finally, you may develop a text filter yourself and use it as a "plug-in" for SSScanner. The next chapter describes how to program text filters. If you still need a new filter or you have some of your own, which you would like to share with other users of SSScanner, click here to contact us.

2. Programming and Installing Text Filters

Text filter is a Dynamic Linked Library (DLL), which exports the following four functions:

  • GetFilterExtensions() informs SSScanner about supported file extensions;
  • CreateFilteredStream() creates output data stream, from which SSScanner reads data;
  • ReadFilteredStream() reads blocks of data from the source file and puts filtered data to the output buffer;
  • CloseFilteredStream() closes the output stream.

SSScanner uses text filters for reading data in following four cases:

  1. At startup to find the supported file extensions. SSScanner loads the filter into memory via LoadLibrary(), then calls GetFilterExtensions and releases the filter via FreeLibrary():
  2. Interaction between SSScanner and Filter
  3. When SSScanner scans files looking for a Search Phrase;
  4. While generating and updating Reports;
  5. When SSScanner pops up with a Look-up window.

In last three cases SSScanner loads the filter, calls CreateFilteredStream, then repeatedly calls ReadFilteredStream until the filtered data end up or SSScanner does not need more data. Finally, SSScanner calls CloseFilteredStream to indicate that it has finished reading data, after which SSScanner may unload the filter from memory:

Interaction between SSScanner and Filter

To give filters an access to the source file, SSScanner exports three callback functions, which addresses are passed as parameters. The callback functions are:

  • GetFileSizeFunc() retrieves the size, in bytes, of the file being read;
  • SetFilePosFunc() moves the current position in the source file and returns the new position;
  • ReadFileFunc() reads the block of data from the source file.

Function prototypes and descriptions

LPCSTR WINAPI GetFilterExtensions(void);
Returns the string containing supported file extensions. The extensions should be returned in upper case as an array of null-terminated strings, finally terminated by two null characters.

DWORD WINAPI CreateFilteredStream(LPSTREAMINFO lpStreamInfo);
CreateFilteredStream is called by SSScanner before it starts reading data from the filtered stream. If required, the filter can allocate resources and save the pointer to them in lpStreamInfo->lpFilterData to access the resources during the subsequent calls related to the same file. Each call to CreateFilteredStream is followed by the call to CloseFilteredStream for the filter to free the allocated resources, if required. The filter can call SetFilePosFunc and/or ReadFileFunc callback functions to access the source file. SSScanner sets the position to the beginning of the source file before calling CreateFilteredStream.
The function must return zero if successful or a non-zero error code, otherwise.
Parameters:
LPSTREAMINFO lpStreamInfo     - address of the STREAMINFO structure used by SSScanner and filters.
SSScanner creates a new instance of the structure each time it accesses a new file. An instance of the structure is created before SSScanner calls CreateFilteredStream and is destroyed after it calls CloseFilteredStream.
The STREAMINFO structure contains the following fields:
DWORD cbStructureSize     - size of structure (in bytes).
LPVOID lpApplicationData  - address of SSScanner's data which is used by callback functions. The filter should not modify it.
LPVOID lpFilterData  - the filter can keep here its own data as 32-bit integer or pointer. SSScanner does not access this member.
LPGETFILESIZEFUNC lpGetFileSizeFunc  - address of the callback function the filter may call to retrieve the size of the source file.
LPSETFILEPOSFUNC lpSetFilePosFunc  - address of the callback function the filter may call to move the position in the source file.
LPREADFILEFUNC lpReadFileFunc  - address of the callback function the filter calls to read data from the source file.

DWORD WINAPI ReadFilteredStream(LPSTREAMINFO lpStreamInfo, LPVOID lpBuffer, DWORD dwBytesToRead, LPDWORD lpdwBytesRead);
Returns the block of data from the filtered stream. SSScanner never asks for more than 64 KB of data at a time. The filter does not have to preserve the file position in the source file between subsequent calls to ReadFilteredStream.
The function must return zero if successful or a non-zero error code, otherwise.
Parameters:
LPVOID lpBuffer     - address of the output buffer, allocated by SSScanner.
DWORD dwBytesToRead  - size of data (in bytes) requested by SSScanner.
LPDWORD lpdwBytesRead  - size of data (in bytes) actually placed by the filter into the output buffer. The zero value indicates that the end of the filtered stream is reached and no more data is available.

DWORD WINAPI CloseFilteredStream(LPSTREAMINFO lpStreamInfo);
CloseFilteredStream is called by SSScanner to inform the filter, that SSScanner has completed reading filtered data. If the filter has allocated any resources and stores the pointer to them in lpStreamInfo->lpFilterData, it is time to free the allocated resources.
The function must return zero if successful or a non-zero error code, otherwise.

Callback Functions

DWORD WINAPI GetFileSizeFunc(LPFILTERINFO lpFilterInfo, LPDWORD lpFileSizeLow, LPDWORD lpFileSizeHigh);
Retrieves the size, in bytes, of the source file. SSScanner retrieves the file size using lpFilterInfo parameter, which must point the same STREAMINFO structure (lpStreamInfo), originally passed by SSScanner to the filter. In SSScanner 2.7 (and later versions) the FILTERINFO structure is the same as STREAMINFO. The file size is returned in the variable pointed by lpFileSizeLow. SSScanner v2.7 (and later versions) does not support files of 2GB or longer and always assigns zero to the variable pointed to by lpFileSizeHigh.
The function returns zero if successful or a non-zero error code, otherwise.

DWORD WINAPI SetFilePosFunc(LPFILTERINFO lpFilterInfo, LPLONG lpDistanceToMoveLow, LPLONG lpDistanceToMoveHigh, DWORD dwMoveMethod);
Moves the current read position (pointer) in the source file and returns the new position.
The function returns zero if successful or a non-zero error code, otherwise.
Parameters:
LPFILTERINFO lpFilterInfo     - address of the STREAMINFO structure, originally passed by SSScanner to the filter.
LPLONG lpDistanceToMoveLow  - points the number of bytes to move the file pointer. A positive value moves the pointer forward in the file and a negative value moves it backward. This parameter also receives the new value of the file pointer from the beginning of the file.
LPLONG lpDistanceToMoveHigh  - SSScanner v2.7 (and later versions) does not support files of 2GB or longer. It always assigns zero to the variable pointed by lpDistanceToMoveHigh and ignores the original value of the variable.
DWORD dwMoveMethod  - the starting point for the file pointer move. This parameter can be one of the values defined for Win32 API function SetFilePointer:
FILE_BEGIN - the starting point is the beginning of the file.
FILE_CURRENT - the current value of the file pointer is the starting point.
FILE_END - the current end-of-file position is the starting point.

DWORD WINAPI ReadFileFunc(LPFILTERINFO lpFilterInfo, LPVOID lpBuffer, DWORD dwBytesToRead, LPDWORD lpdwBytesRead);
Reads the block of data from the source file.
The function returns zero if successful or a non-zero error code, otherwise.
Parameters:
LPFILTERINFO lpFilterInfo     - address of the STREAMINFO structure, originally passed by SSScanner to the filter.
LPVOID lpBuffer  - address of the output buffer (to be allocated by the filter).
DWORD dwBytesToRead  - size of data (in bytes) requested by the filter.
LPDWORD lpdwBytesRead  - size of data (in bytes) actually placed by SSScanner into the output buffer. The zero value means, that the end of the source file is reached and no more source data is available.

Installing Text Filters

Once a filter is ready, simply place the corresponding DLL into the "Filters" subfolder of the SSScanner root folder, then re-start SSScanner. After it, select the Text Filters frame, in which the new filter should appear (see Pic. 2 above). If the Action field of this filter does not indicate Apply, double-click it with the left mouse button to activate this filter.

Appendix A: Sample Source Code of RTF Filter in C

#include <windows.h>

#define MAX_BUFFER_SIZE 1024 * 64

typedef enum { rdsNorm, rdsSkip } RDS;              // Rtf Destination State
typedef enum { risNorm, risBin, risHex } RIS;       // Rtf Internal State

typedef struct save             // property save structure
{
    struct save *pNext;         // next save
    RDS rds;
    RIS ris;
} SAVE;

typedef enum {ipfnBin, ipfnHex, ipfnSkipDest } IPFN;
typedef enum {idestPict, idestSkip } IDEST;
typedef enum {kwdChar, kwdDest, kwdProp, kwdSpec} KWD;

typedef struct symbol
{
    char szKeyword[24];
    KWD  kwd;
    int  idx;
} SYM;

typedef enum {rstNorm, rstCtrl, rstVal} RSTATE; // parsimg mode: normal, control word,
                                                //   parameter value
// Filter declarations
typedef DWORD (WINAPI *LPGETFILESIZEFUNC)(LPFILTERINFO, LPDWORD, LPDWORD);
typedef DWORD (WINAPI *LPSETFILEPOSFUNC)(LPFILTERINFO, LPLONG, LPLONG, DWORD);
typedef DWORD (WINAPI *LPREADFILEFUNC)(LPFILTERINFO, LPVOID, DWORD, LPDWORD);
typedef struct {
  DWORD cbStructureSize;
  LPVOID lpApplicationData;
  LPVOID lpFilterData;
  LPGETFILESIZEFUNC lpGetFileSizeFunc;
  LPSETFILEPOSFUNC lpSetFilePosFunc;
  LPREADFILEFUNC lpReadFileFunc;
} FILTERINFO, *LPFILTERINFO, STREAMINFO, *LPSTREAMINFO;

typedef struct {
  int nFilledFrom;
  int nFilledTill;
  RSTATE rstate;
  int fSkipDestIfUnk;
  long cbBin;
  long lParam;
  RDS rds;
  RIS ris;
  SAVE *psave;
  int nOverflowCount;
  int cNibble;
  int nHexValue;
  int fNeg;
  int nCharCount;
  char *lpszDestBuffer;
  LPDWORD lpdwDestCounter;
  char szKeyword[30];
  char szParameter[20];
  char buffer[MAX_BUFFER_SIZE];
} FILTERDATA, *LPFILTERDATA;

// RTF parser declarations

void PushRtfState(LPFILTERDATA lpfd);
void PopRtfState(LPFILTERDATA lpfd);
void ParseChar(char c, LPFILTERDATA lpfd);
void TranslateKeyword(LPFILTERDATA lpfd);
void ParseSpecialKeyword(IPFN ipfn, LPFILTERDATA lpfd);

// Static const data:
// RTF parser tables - Keyword descriptions
SYM rgsymRtf[58] = {
    "\x0a",     kwdChar,    0x0d,
    "\x0d",     kwdChar,    0x0d,
    "'",        kwdSpec,    ipfnHex,
    "*",        kwdSpec,    ipfnSkipDest,
    "\\",       kwdChar,    '\\',
    "author",   kwdDest,    idestSkip,
    "bin",      kwdSpec,    ipfnBin,
    "buptim",   kwdDest,    idestSkip,
    "colortbl", kwdDest,    idestSkip,
    "comment",  kwdDest,    idestSkip,
    "creatim",  kwdDest,    idestSkip,
    "doccomm",  kwdDest,    idestSkip,
    "fonttbl",  kwdDest,    idestSkip,
    "footer",   kwdDest,    idestSkip,
    "footerf",  kwdDest,    idestSkip,
    "footerl",  kwdDest,    idestSkip,
    "footerr",  kwdDest,    idestSkip,
    "footnote", kwdDest,    idestSkip,
    "ftncn",    kwdDest,    idestSkip,
    "ftnsep",   kwdDest,    idestSkip,
    "ftnsepc",  kwdDest,    idestSkip,
    "header",   kwdDest,    idestSkip,
    "headerf",  kwdDest,    idestSkip,
    "headerl",  kwdDest,    idestSkip,
    "headerr",  kwdDest,    idestSkip,
    "info",     kwdDest,    idestSkip,
    "keywords", kwdDest,    idestSkip,
    "ldblquote",kwdChar,    '"',
    "line",     kwdChar,    0x0d,
    "operator", kwdDest,    idestSkip,
    "par",      kwdChar,    0x0d,
    "pict",     kwdDest,    idestSkip,
    "printim",  kwdDest,    idestSkip,
    "private1", kwdDest,    idestSkip,
    "rdblquote",kwdChar,    '"',
    "revtim",   kwdDest,    idestSkip,
    "rxe",      kwdDest,    idestSkip,
    "stylesheet",kwdDest,    idestSkip,
    "subject",  kwdDest,    idestSkip,
    "tab",      kwdChar,    0x09,
    "tc",       kwdDest,    idestSkip,
    "title",    kwdDest,    idestSkip,
    "txe",      kwdDest,    idestSkip,
    "xe",       kwdDest,    idestSkip,
    "{",        kwdChar,    '{',
    "}",        kwdChar,    '}'
};

LPCSTR WINAPI GetFilterExtensions() {
  return "RTF\000\000";
}

DWORD WINAPI CreateFilteredStream(LPSTREAMINFO lpsi) {
  DWORD dwResult = ERROR_SUCCESS;
  LPFILTERDATA lpfd;
  lpsi->lpFilterData = NULL;
  if (lpfd = malloc(sizeof(FILTERDATA))) {
    lpsi->lpFilterData = lpfd;
    ZeroMemory(lpfd, sizeof(FILTERDATA));
    lpfd->cNibble = 2;
    lpfd->nFilledTill = -1;
  } else
    dwResult = GetLastError();
  return dwResult;
}

void ProcessSourceBuffer(LPFILTERDATA lpfd, DWORD dwBytesToRead) {
  char ch;
  while(lpfd->nFilledFrom <= lpfd->nFilledTill && *lpfd->lpdwDestCounter <= dwBytesToRead){
    ch = lpfd->buffer[lpfd->nFilledFrom];
    switch (lpfd->rstate) {
    case rstNorm:
      if (lpfd->ris == risBin) // if we're parsing binary data, handle it directly
        ParseChar(ch, lpfd);
      else {
        switch (ch) {
        case '{':
          PushRtfState(lpfd);
          break;
        case '}':
          PopRtfState(lpfd);
          break;
        case '\\':
          lpfd->rstate = rstCtrl;
          lpfd->fNeg = FALSE;
          lpfd->szKeyword[0] = '\0';
          lpfd->nCharCount = 0;
          break;
        case 0x0d:
        case 0x0a:          // cr and lf are noise characters...
          break;
        default:
          if (lpfd->ris == risNorm) 
            ParseChar(ch, lpfd);
          else {               // parsing hex data
            lpfd->nHexValue <<= 4;
            if (ch > '9') {
              ch = (ch | 0x20) - ('a' - '9' - 1);
            }
            lpfd->nHexValue += ch - '0';
            lpfd->cNibble--;
            if (!lpfd->cNibble) {
              ParseChar((char)lpfd->nHexValue, lpfd);
              lpfd->cNibble = 2;
              lpfd->nHexValue = 0;
              lpfd->ris = risNorm;
            }
          }                   // end else (ris != risNorm)
          break;
        }       // switch (ch)
      }           // else (ris != risBin)
      break;
    case rstCtrl:
      if (isalpha(ch)) {          // still a keyword.
        lpfd->szKeyword[lpfd->nCharCount] = (char) ch;
        if (lpfd->nCharCount < sizeof(lpfd->szKeyword) - 1)
          lpfd->nCharCount++;
      } else {
        if (lpfd->nCharCount == 0) {  // a control symbol - no delimiter
          lpfd->szKeyword[0] = (char) ch;
          lpfd->szKeyword[1] = '\0';
          TranslateKeyword(lpfd);
          lpfd->rstate = rstNorm;
        } else {
          lpfd->szKeyword[lpfd->nCharCount] = '\0';
          switch (ch) {
          case '\\':
            lpfd->lParam = 0;
            TranslateKeyword(lpfd);
            lpfd->szKeyword[0] = '\0';
            lpfd->nCharCount = 0;
            break;
          case ' ':
            lpfd->lParam = 0;
            TranslateKeyword(lpfd);
            lpfd->rstate = rstNorm;
            break;
          default:
            lpfd->nFilledFrom--;
            lpfd->rstate = rstVal;
            lpfd->szParameter[0] = '\0';
            lpfd->nCharCount = 0;
          }
        }
      }
      break;
    case rstVal:
      if (ch == '-') {
        lpfd->fNeg  = TRUE;
      } else {
        if (isdigit(ch)) {
          lpfd->szParameter[lpfd->nCharCount] = (char) ch;
          if (lpfd->nCharCount < sizeof(lpfd->szKeyword) - 1)
            lpfd->nCharCount++;
        } else {
          lpfd->szParameter[lpfd->nCharCount] = '\0';
          lpfd->lParam = atol(lpfd->szParameter);
          if (lpfd->fNeg)
            lpfd->lParam = -lpfd->lParam;
          TranslateKeyword(lpfd);
          if (ch == '\\') {
            lpfd->rstate = rstCtrl;
            lpfd->fNeg = FALSE;
            lpfd->szKeyword[0] = '\0';
            lpfd->nCharCount = 0;
          } else {
            if (ch != ' ') {
              lpfd->nFilledFrom--;
            }
            lpfd->rstate = rstNorm;
          }
        }
      break;
      }
    }    // switch (rstate)
    lpfd->nFilledFrom++;
  }//while(lpfd->nFilledFrom<=lpfd->nFilledTill && *lpfd->lpdwDestCounter<=dwBytesToRead)
}

DWORD WINAPI ReadFilteredStream(LPSTREAMINFO lpsi, LPVOID lpBuffer,
                                DWORD dwBytesToRead, LPDWORD lpdwBytesRead) {
  DWORD dwResult = ERROR_INVALID_PARAMETER;
  LPFILTERDATA lpfd = lpsi->lpFilterData;
  *lpdwBytesRead = 0;
  if (lpfd) {
    dwResult = ERROR_SUCCESS;
    lpfd->lpszDestBuffer = lpBuffer;
    lpfd->lpdwDestCounter = lpdwBytesRead;
    ProcessSourceBuffer(lpfd, dwBytesToRead - 1);
    if (*lpdwBytesRead < dwBytesToRead) { //else we do not have to read more source data
      // here we are when the SourceBuffer is empty but still we are asked for more data
      DWORD dwSourceBytesRead;
      do {
        dwResult = (lpsi->lpReadFileFunc)(lpsi, (LPVOID) lpfd->buffer,
          sizeof(lpfd->buffer), &dwSourceBytesRead);
        if (0 != dwResult) break;
        lpfd->nFilledFrom = 0;
        lpfd->nFilledTill = dwSourceBytesRead - 1;
        ProcessSourceBuffer(lpfd, dwBytesToRead - 1);
      } while (dwSourceBytesRead != 0  && *lpdwBytesRead < dwBytesToRead);        
    }          
  }
  return dwResult;
}
DWORD WINAPI CloseFilteredStream(LPSTREAMINFO lpsi) {
  if (lpsi->lpFilterData) {
    LPFILTERDATA lpfd = lpsi->lpFilterData;
    while (NULL != lpfd->psave)
      PopRtfState(lpfd);
    free(lpsi->lpFilterData);
    lpsi->lpFilterData = NULL;
  }
  return 0;
}

void PushRtfState(LPFILTERDATA lpfd)
{
  SAVE *psaveNew = malloc(sizeof(SAVE));
  if (psaveNew == NULL) {
    lpfd->nOverflowCount++;
    return;
  }
  psaveNew -> pNext = lpfd->psave;
  psaveNew -> rds = lpfd->rds;
  psaveNew -> ris = lpfd->ris;
  lpfd->ris = risNorm;
  lpfd->psave = psaveNew;
  return;
}

void PopRtfState(LPFILTERDATA lpfd)
{
  SAVE *psaveOld;
  
  if (!lpfd->psave)
    return; //ignore unmatched '}'
  if (lpfd->nOverflowCount != 0) {
    lpfd->nOverflowCount--;
    return;
  }

  lpfd->rds = lpfd->psave->rds;
  lpfd->ris = lpfd->psave->ris;
  
  psaveOld = lpfd->psave;
  lpfd->psave = lpfd->psave->pNext;
  free(psaveOld);
}

__inline void ParseChar(char ch, LPFILTERDATA lpfd)
{
  if (lpfd->ris == risBin && --lpfd->cbBin <= 0)
    lpfd->ris = risNorm;
  if (rdsNorm == lpfd->rds) {
    (lpfd->lpszDestBuffer)[(*lpfd->lpdwDestCounter)++] = ch;
  }
  return;
}

void TranslateKeyword(LPFILTERDATA lpfd)
{
  int lb, rb, isym, cmpres;

  // search for szKeyword in rgsymRtf
  lb = 0;
  rb = sizeof(rgsymRtf) / sizeof(SYM) - 1;
  while (lb <= rb) {
    isym = (lb + rb) >> 1;
    cmpres = strcmp(lpfd->szKeyword, rgsymRtf[isym].szKeyword);
    if (cmpres < 0)
      if (isym < rb) rb = isym; else rb--;
    else if (cmpres > 0)
      if (isym > lb) lb = isym; else lb++;
    else
      break;
  }
  if (0 != cmpres) {           // control word not found
    if (lpfd->fSkipDestIfUnk)         // if this is a new destination
      lpfd->rds = rdsSkip;          // skip the destination
    // else just discard it
    lpfd->fSkipDestIfUnk = FALSE;
    return;
  }
  
  // found it!  use kwd and idx to determine what to do with it.
  lpfd->fSkipDestIfUnk = FALSE;
  switch (rgsymRtf[isym].kwd) {
  case kwdChar:
    ParseChar((char)rgsymRtf[isym].idx, lpfd);
    break;
  case kwdDest:
    lpfd->rds = rdsSkip;
    break;
  case kwdSpec:
    ParseSpecialKeyword(rgsymRtf[isym].idx, lpfd);
  }
  return;
}

void ParseSpecialKeyword(IPFN ipfn, LPFILTERDATA lpfd)
{
  if (lpfd->rds == rdsSkip && ipfn != ipfnBin)  // if we're skipping, and it's not
    return;                        // the \bin keyword, ignore it.
  switch (ipfn) {
  case ipfnBin:
    lpfd->ris = risBin;
    lpfd->cbBin = lpfd->lParam;
    break;
  case ipfnSkipDest:
    lpfd->fSkipDestIfUnk = TRUE;
    break;
  case ipfnHex:
    lpfd->ris = risHex;
    break;
  }
  return;
}

Appendix B: Sample Source Code of HTML Filter in Borland Delphi

{$A-,B-,R-,Q-,I-,V-}

Library Htm2Txt;

Uses Windows;

Const
      SupportedExtentions : array [0..10] of char = 'HTM'#0 + 'HTML'#0#0;
      MaxSourceBufferLength = 65536; { 64K }
      CR = #13; // Character Return
      Space = ' ';

Type
     StreamInfoPtrType = ^StreamInfoType;

     GetFileSizeFuncType = function(    StreamInfoPtr : StreamInfoPtrType;
                                    var FileSizeLow,
                                        FileSizeHigh  : LongInt) : DWORD; stdcall;

     SetFilePosFuncType = function(    StreamInfoPtr : StreamInfoPtrType;
                                   var OffsetLow,
                                       OffsetHigh    : LongInt;
                                       MoveMethod    : DWORD) : DWORD; stdcall;

     ReadFileFuncType = function(    StreamInfoPtr : StreamInfoPtrType;
                                 var Buffer;
                                     BytesToRead   : LongInt;
                                 var BytesRead     : LongInt) : DWORD; stdcall;

     StreamInfoType = record
      cbStructureSize   : DWORD;
      lpApplicationData : Pointer;
      lpFilterData      : Pointer;
      GetFileSizeFunc   : GetFileSizeFuncType;
      SetFilePosFunc    : SetFilePosFuncType;
      ReadFileFunc      : ReadFileFuncType;
     end;

Type
     StatusType = (TextMode,
                   AboutHTMLTag,
                   InsideHTMLTag,
                   SpecialSymbolMode,
                   LongComment,
                   AboutLongComment1,
                   AboutLongComment2,
                   EndLongComment1,
                   EndLongComment2);

     InternalBufferType = array[0..MaxSourceBufferLength-1] of char;

     FilterDataType = record
      SourceBuffer           : InternalBufferType;
      FilledFrom, FilledTill : Integer;
      Status                 : StatusType;
      LastChar               : Char;
      HeadIsNotPassedYet,
      OutsideHead            : Boolean;
      SpecialString,
      HTMLTag                : ShortString
     end;
     FilterDataPtrType = ^FilterDataType;
     ZeroBasedArrayOfChar = array[0..0] of char;
     ZeroBasedArrayOfCharPtr = ^ZeroBasedArrayOfChar;

function GetFilterExtensions : PChar; export; stdcall;
begin { GetFilterExtensions }
 GetFilterExtensions := @SupportedExtentions[0]
end;  { GetFilterExtensions }

function CreateFilteredStream(var StreamInfo : StreamInfoType) : DWORD; export; stdcall;
begin { CreateFilteredStream }
 StreamInfo.lpFilterData := Nil;
 try
  GetMem(StreamInfo.lpFilterData, SizeOf(FilterDataType));
  with FilterDataPtrType(StreamInfo.lpFilterData)^ do begin
   FilledFrom := 0;
   FilledTill := -1;
   Status := TextMode;
   SpecialString := '';
   HTMLTag       := '';
   LastChar      := Space;
   HeadIsNotPassedYet := True;
   OutsideHead        := True;
  end;
  CreateFilteredStream := 0
 except
  CreateFilteredStream := ERROR_OUTOFMEMORY
 end
end;  { CreateFilteredStream }

procedure ProcessSourceBuffer(var FilterData   : FilterDataType;
                              var DestBuffer   : ZeroBasedArrayOfChar;
                              var DestFillFrom : Integer;
                                  DestFillTill : Integer);

Type SymbolsConversionType = record
                              HTMLReprtesentation : ShortString;
                              SSSReprtesentation  : char;
                             end;

Const TotalSpecialSymbols = 69;
      StandardSpecialString : array[1..TotalSpecialSymbols] of SymbolsConversionType =
      (
      (HTMLReprtesentation: 'quot';     SSSReprtesentation: '"'),
      (HTMLReprtesentation: 'amp';      SSSReprtesentation: '&'),
      (HTMLReprtesentation: 'lt';       SSSReprtesentation: '<'),
      (HTMLReprtesentation: 'gt';       SSSReprtesentation: '>'),
      (HTMLReprtesentation: 'nbsp';     SSSReprtesentation: ' '),
      (HTMLReprtesentation: 'copy';     SSSReprtesentation: #169),
      (HTMLReprtesentation: 'shy';      SSSReprtesentation: #173),
      (HTMLReprtesentation: 'Agrave';   SSSReprtesentation: #192),
      (HTMLReprtesentation: 'Aacute';   SSSReprtesentation: #193),
      (HTMLReprtesentation: 'Acirc';    SSSReprtesentation: #194),
      (HTMLReprtesentation: 'Atilde';   SSSReprtesentation: #195),
      (HTMLReprtesentation: 'Auml';     SSSReprtesentation: #196),
      (HTMLReprtesentation: 'Aring';    SSSReprtesentation: #197),
      (HTMLReprtesentation: 'AElig';    SSSReprtesentation: #198),
      (HTMLReprtesentation: 'Ccedil';   SSSReprtesentation: #199),
      (HTMLReprtesentation: 'Egrave';   SSSReprtesentation: #200),
      (HTMLReprtesentation: 'Eacute';   SSSReprtesentation: #201),
      (HTMLReprtesentation: 'Ecirc';    SSSReprtesentation: #202),
      (HTMLReprtesentation: 'Euml';     SSSReprtesentation: #203),
      (HTMLReprtesentation: 'Igrave';   SSSReprtesentation: #204),
      (HTMLReprtesentation: 'Iacute';   SSSReprtesentation: #205),
      (HTMLReprtesentation: 'Icirc';    SSSReprtesentation: #206),
      (HTMLReprtesentation: 'Iuml';     SSSReprtesentation: #207),
      (HTMLReprtesentation: 'ETH';      SSSReprtesentation: #208),
      (HTMLReprtesentation: 'Ntilde';   SSSReprtesentation: #209),
      (HTMLReprtesentation: 'Ograve';   SSSReprtesentation: #210),
      (HTMLReprtesentation: 'Oacute';   SSSReprtesentation: #211),
      (HTMLReprtesentation: 'Ocirc';    SSSReprtesentation: #212),
      (HTMLReprtesentation: 'Otilde';   SSSReprtesentation: #213),
      (HTMLReprtesentation: 'Ouml';     SSSReprtesentation: #214),
      (HTMLReprtesentation: 'Oslash';   SSSReprtesentation: #216),
      (HTMLReprtesentation: 'Ugrave';   SSSReprtesentation: #217),
      (HTMLReprtesentation: 'Uacute';   SSSReprtesentation: #218),
      (HTMLReprtesentation: 'Ucirc';    SSSReprtesentation: #219),
      (HTMLReprtesentation: 'Uuml';     SSSReprtesentation: #220),
      (HTMLReprtesentation: 'Yacute';   SSSReprtesentation: #221),
      (HTMLReprtesentation: 'THORN';    SSSReprtesentation: #222),
      (HTMLReprtesentation: 'szlig';    SSSReprtesentation: #223),
      (HTMLReprtesentation: 'agrave';   SSSReprtesentation: #224),
      (HTMLReprtesentation: 'aacute';   SSSReprtesentation: #225),
      (HTMLReprtesentation: 'acirc';    SSSReprtesentation: #226),
      (HTMLReprtesentation: 'atilde';   SSSReprtesentation: #227),
      (HTMLReprtesentation: 'auml';     SSSReprtesentation: #228),
      (HTMLReprtesentation: 'aring';    SSSReprtesentation: #229),
      (HTMLReprtesentation: 'aelig';    SSSReprtesentation: #230),
      (HTMLReprtesentation: 'ccedil';   SSSReprtesentation: #231),
      (HTMLReprtesentation: 'egrave';   SSSReprtesentation: #232),
      (HTMLReprtesentation: 'eacute';   SSSReprtesentation: #233),
      (HTMLReprtesentation: 'ecirc';    SSSReprtesentation: #234),
      (HTMLReprtesentation: 'euml';     SSSReprtesentation: #235),
      (HTMLReprtesentation: 'igrave';   SSSReprtesentation: #236),
      (HTMLReprtesentation: 'iacute';   SSSReprtesentation: #237),
      (HTMLReprtesentation: 'icirc';    SSSReprtesentation: #238),
      (HTMLReprtesentation: 'iuml';     SSSReprtesentation: #239),
      (HTMLReprtesentation: 'eth';      SSSReprtesentation: #240),
      (HTMLReprtesentation: 'ntilde';   SSSReprtesentation: #241),
      (HTMLReprtesentation: 'ograve';   SSSReprtesentation: #242),
      (HTMLReprtesentation: 'oacute';   SSSReprtesentation: #243),
      (HTMLReprtesentation: 'ocirc';    SSSReprtesentation: #244),
      (HTMLReprtesentation: 'otilde';   SSSReprtesentation: #245),
      (HTMLReprtesentation: 'ouml';     SSSReprtesentation: #246),
      (HTMLReprtesentation: 'oslash';   SSSReprtesentation: #248),
      (HTMLReprtesentation: 'ugrave';   SSSReprtesentation: #249),
      (HTMLReprtesentation: 'uacute';   SSSReprtesentation: #250),
      (HTMLReprtesentation: 'ucirc';    SSSReprtesentation: #251),
      (HTMLReprtesentation: 'uuml';     SSSReprtesentation: #252),
      (HTMLReprtesentation: 'yacute';   SSSReprtesentation: #253),
      (HTMLReprtesentation: 'thorn';    SSSReprtesentation: #254),
      (HTMLReprtesentation: 'yuml';     SSSReprtesentation: #255)
      );

      TotalLineBreakingTags = 37;
      LineBreakingTag : array[1..TotalLineBreakingTags] of ShortString =
      ('P',      '/P/index.html',       'BR',        'CENTER', '/CENTER/index.html',
       'HR',     'LI',       'TABLE',     '/TABLE/index.html', 'DIR',
       '/DIR/index.html' ,  'DIV',      '/DIV/index.html',      'FORM',   '/FORM/index.html',
       'FRAME',  'FRAMESET', '/FRAMESET/index.html', 'H1',     'H2',
       'H3',     'H4',       'H5',        'H6',     'H7',
       '/H1/index.html',    '/H2/index.html',      '/H3/index.html',       '/H4/index.html',    '/H5/index.html',
       '/H6/index.html',    '/H7/index.html',      'OL',        '/OL/index.html',    'TR',
       'UL',        '/UL/index.html');

var C     : Char;
    Found : Boolean;
    i, j  : Integer;

begin
 with FilterData do begin
  while (FilledFrom <= FilledTill) and (DestFillFrom <= DestFillTill) do begin
   C := SourceBuffer[FilledFrom];
   case Status of
    TextMode : case C of
     '<'  : Status := AboutHTMLTag;
     '&'  : begin
             Status := SpecialSymbolMode;
             SpecialString := ''
            end;
     else if OutsideHead and ((LastChar > Space) or (C > Space)) then begin
      DestBuffer[DestFillFrom] := C;
      Inc(DestFillFrom);
      LastChar := C
     end
    end;
    AboutHTMLTag : case C of
     '!'     : Status := AboutLongComment1;
     '>'     : Status := TextMode;
     else      begin
      HTMLTag[0] := #1;
      HTMLTag[1] := UpCase(C);
      Status := InsideHTMLTag
     end
    end;
    AboutLongComment1 : case C of
     '-'  : Status := AboutLongComment2;
     '>'  : Status := TextMode;
     else   begin
      HTMLTag[0] := #2;
      HTMLTag[1] := '!';
      HTMLTag[2] := UpCase(C);
      Status := InsideHTMLTag
     end
    end;
    AboutLongComment2 : case C of
     '-'  : Status := LongComment;
     '>'  : Status := TextMode;
     else   begin
      HTMLTag[0] := #3;
      HTMLTag[1] := '!';
      HTMLTag[2] := '-';
      HTMLTag[3] := UpCase(C);
      Status := InsideHTMLTag
     end
    end;
    EndLongComment1 : case C of
     '-'  : Status := EndLongComment2;
     else   Status := LongComment;
    end;
    EndLongComment2 : case C of
     '-'  : { nothing };
     '>'  : Status := TextMode;
     else   Status := LongComment;
    end;
    InsideHTMLTag : if C = '>' then begin
     Status := TextMode; // always
     if HeadIsNotPassedYet then begin
      if HTMLTag = 'HEAD' then begin
       OutsideHead := False;
       Inc(FilledFrom);
       Continue
      end;

      if (HTMLTag = '/HEAD/index.html')
             or
         (
          (Pos('BODY', HTMLTag) = 1) and
          ((HTMLTag = 'BODY') or (HTMLTag[5] <= Space))
         )
      then begin
       OutsideHead := True;
       HeadIsNotPassedYet := False;
       Inc(FilledFrom);
       Continue
      end;
      if HTMLTag = 'TITLE'
      then OutsideHead := True else
      if HTMLTag = '/TITLE/index.html' then begin
       OutsideHead := False;
       DestBuffer[DestFillFrom] := CR;
       Inc(DestFillFrom);
       LastChar := CR;
       Inc(FilledFrom);
       Continue
      end;
     end { HeadIsNotPassedYet };

     if OutsideHead then begin
      Found := False;
      for i := 1 to TotalLineBreakingTags do
      if (LineBreakingTag[i][1] = HTMLTag[1]) and
         (Pos(LineBreakingTag[i], HTMLTag) = 1) and
         (
          (LineBreakingTag[i][0] = HTMLTag[0]) or
          (HTMLTag[Byte(LineBreakingTag[i][0]) + 1] <= Space)
         )
      then begin
       DestBuffer[DestFillFrom] := CR;
       Inc(DestFillFrom);
       LastChar := CR; // no matter if it is CR or LF
       Found := True;
       Break
      end;

      if not Found and (LastChar > Space) then begin
       DestBuffer[DestFillFrom] := Space;
       Inc(DestFillFrom);
       LastChar := Space
      end
     end { OutsideHead }
    end else HTMLTag := HTMLTag + UpCase(C);

    SpecialSymbolMode : if C in [#0..Space, ';', '<', '&', '>'] then begin
     case C of
      #0..Space, ';', '>' : Status := TextMode;
      '<'                 : Status := AboutHTMLTag;
      '&'                 : SpecialString := ''; {Status is already SpecialSymbolMode}
     end;
     if OutsideHead then begin
      Found := False;
      if (Length(SpecialString) > 0) and (SpecialString[1] = '#') then begin
       Delete(SpecialString, 1, 1);
       Val(SpecialString, i, j);
       if j = 0 then begin
        DestBuffer[DestFillFrom] := Char(i);
        Inc(DestFillFrom);
        LastChar := Char(i);
        Found := True
       end
      end;

      if Not Found then
      for i := 1 to TotalSpecialSymbols do
      if SpecialString = StandardSpecialString[i].HTMLReprtesentation
      then begin
       Found := True;
       DestBuffer[DestFillFrom] := StandardSpecialString[i].SSSReprtesentation;
       Inc(DestFillFrom);
       LastChar := StandardSpecialString[i].SSSReprtesentation;
       Break
      end;
      if Not Found then begin
       DestBuffer[DestFillFrom] := Space;
       Inc(DestFillFrom);
       LastChar := Space
      end
     end { OutsideHeader }
    end else SpecialString := SpecialString + C;
    LongComment   : if C = '-' then Status := EndLongComment1;
   end { case Status };
   Inc(FilledFrom)
  end { while (FilledFrom <= FilledTill) and (DestFillFrom <= DestFillFrom) }
 end { with FilterData }
end;

function ReadFilteredStream(var StreamInfo : StreamInfoType;
                            var Buffer;
                                BufferSize : LongInt;
                            var BytesRead  : LongInt) : DWORD; stdcall;

var SourceBytesRead : LongInt;
    FilterDataPtr   : FilterDataPtrType;

begin { ReadFilteredStream }
 BytesRead := 0;
 FilterDataPtr := StreamInfo.lpFilterData;
 if FilterDataPtr = Nil then begin
  Result := ERROR_INVALID_PARAMETER;
  Exit;
 end;

 Result := 0;
 { processing the contents of SourceBuffer left during one of previous calls }
 ProcessSourceBuffer(FilterDataPtr^,
                     ZeroBasedArrayOfCharPtr(@Buffer)^,
                     BytesRead,
                     BufferSize-1);
 if BytesRead >= BufferSize then { we do not have to read more source data, so: } Exit;

 { here we are when the SourceBuffer is empty but still we are asked for more data }
 repeat
  Result := StreamInfo.ReadFileFunc(@StreamInfo,
                                    FilterDataPtr^.SourceBuffer,
                                    MaxSourceBufferLength,
                                    SourceBytesRead);
  if Result <> 0 then { error reading data, so: } Exit;
  with FilterDataPtr^ do begin
   FilledFrom := 0;
   FilledTill := SourceBytesRead - 1
  end;
  ProcessSourceBuffer(FilterDataPtr^,
                      ZeroBasedArrayOfCharPtr(@Buffer)^,
                      BytesRead,
                      BufferSize-1)
 until (SourceBytesRead = 0) or (BytesRead >= BufferSize);
end;  { ReadFilteredStream }

function CloseFilteredStream(var StreamInfo : StreamInfoType) : DWORD; export; stdcall;
begin { CloseFilteredStream }
 Result := ERROR_INVALID_PARAMETER;
 if StreamInfo.lpFilterData <> Nil then begin
  try
   FreeMem(StreamInfo.lpFilterData, SizeOf(FilterDataType));
   StreamInfo.lpFilterData := Nil;
   Result := 0
  except
  end
 end
end;  { CloseFilteredStream }

Exports GetFilterExtensions  index 1,
        CreateFilteredStream index 2,
        ReadFilteredStream   index 3,
        CloseFilteredStream  index 4;

Begin
End.

Home | Download | Purchase | Contact | Company

Copyright © 1997-2010 Kryloff Technologies, Inc. All Rights Reserved