![]() |
KT Text Filters | ||
|
||||
![]() |
![]() |
![]() |
1. IntroductionThis document explains how KT Text Filters are used by and communicate with Subject Search Scanner™, one of the Kryloff Technologies software products. The rest of Kryloff products also use filters; the way they communicate with KT Text Filters is approximately the same. Subject Search Scanner™ has been developed to search through files of different types such as Text Files (TXT), HTML Files (HTM), Rich Text Files (RTF) and others. The data storage formats of all these file types are different: along with textual data, some of them may contain images or sounds or can even store encrypted data. As the SSScanner smart search engine is able to process textual data only, Kryloff Technologies ships this (and the rest of its products) with a number of KT Text Filters. The primary designation of any Text Filter is to perform extraction of textual data stored in source files and to pass it onto SSScanner (i.e., onto the calling application, process or thread). Upon receiving data from the filter, SSScanner analyses them, then selects and reports with the most relevant quotations. So, the search process for SSScanner looks like it reads input data from text files only regardless of their internal architecture. The actual data flow is shown below: ![]() Pic. 1. Data Flow in SSScanner Since different companies or individuals keep their data in too many different formats, you may face a situation when Kryloff Technologies has not included the required Text Filter into the original shipment, and SSScanner does not process some of your files correctly. Even in this case SSScanner is your solution as it is fully configurable to search through any files! We recommend that you first, look attentively at the list of filters shown on the Text Filters page of SSScanner: ![]() Pic. 2. Text Filters in SSScanner If the required text filter is in list, make sure it is enabled (the Action column indicates "Apply"). If files of your type are not processed by any of the filters, visit the SSScanner product page to check if the filter you need has been developed by Kryloff Technologies or a third company and it is now available for free downloads. If you have obtained a filter, simply place its file(s) into the "Filters" subfolder of the SSScanner root folder, then re-start SSScanner. After it, the filter should appear in the list and all you have to do now to complete adding a new filter is to make sure it is enabled. Once you've done it, SSScanner keeps your settings for next sessions. Finally, you may develop a text filter yourself and use it as a "plug-in" for SSScanner. The next chapter describes how to program text filters. If you still need a new filter or you have some of your own, which you would like to share with other users of SSScanner, click here to contact us. 2. Programming and Installing Text FiltersText filter is a Dynamic Linked Library (DLL), which exports the following four functions:
SSScanner uses text filters for reading data in following four cases:
![]() In last three cases SSScanner loads the filter, calls CreateFilteredStream, then repeatedly calls ReadFilteredStream until the filtered data end up or SSScanner does not need more data. Finally, SSScanner calls CloseFilteredStream to indicate that it has finished reading data, after which SSScanner may unload the filter from memory: ![]() To give filters an access to the source file, SSScanner exports three callback functions, which addresses are passed as parameters. The callback functions are:
Function prototypes and descriptionsLPCSTR WINAPI GetFilterExtensions(void); DWORD WINAPI CreateFilteredStream(LPSTREAMINFO lpStreamInfo); DWORD WINAPI ReadFilteredStream(LPSTREAMINFO lpStreamInfo, LPVOID lpBuffer,
DWORD dwBytesToRead, LPDWORD lpdwBytesRead); DWORD WINAPI CloseFilteredStream(LPSTREAMINFO lpStreamInfo); Callback FunctionsDWORD WINAPI GetFileSizeFunc(LPFILTERINFO lpFilterInfo, LPDWORD lpFileSizeLow,
LPDWORD lpFileSizeHigh); DWORD WINAPI SetFilePosFunc(LPFILTERINFO lpFilterInfo, LPLONG lpDistanceToMoveLow,
LPLONG lpDistanceToMoveHigh, DWORD dwMoveMethod); DWORD WINAPI ReadFileFunc(LPFILTERINFO lpFilterInfo, LPVOID lpBuffer,
DWORD dwBytesToRead, LPDWORD lpdwBytesRead); Installing Text FiltersOnce a filter is ready, simply place the corresponding DLL into the "Filters" subfolder of the SSScanner root folder, then re-start SSScanner. After it, select the Text Filters frame, in which the new filter should appear (see Pic. 2 above). If the Action field of this filter does not indicate Apply, double-click it with the left mouse button to activate this filter. Appendix A: Sample Source Code of RTF Filter in C
#include <windows.h>
#define MAX_BUFFER_SIZE 1024 * 64
typedef enum { rdsNorm, rdsSkip } RDS; // Rtf Destination State
typedef enum { risNorm, risBin, risHex } RIS; // Rtf Internal State
typedef struct save // property save structure
{
struct save *pNext; // next save
RDS rds;
RIS ris;
} SAVE;
typedef enum {ipfnBin, ipfnHex, ipfnSkipDest } IPFN;
typedef enum {idestPict, idestSkip } IDEST;
typedef enum {kwdChar, kwdDest, kwdProp, kwdSpec} KWD;
typedef struct symbol
{
char szKeyword[24];
KWD kwd;
int idx;
} SYM;
typedef enum {rstNorm, rstCtrl, rstVal} RSTATE; // parsimg mode: normal, control word,
// parameter value
// Filter declarations
typedef DWORD (WINAPI *LPGETFILESIZEFUNC)(LPFILTERINFO, LPDWORD, LPDWORD);
typedef DWORD (WINAPI *LPSETFILEPOSFUNC)(LPFILTERINFO, LPLONG, LPLONG, DWORD);
typedef DWORD (WINAPI *LPREADFILEFUNC)(LPFILTERINFO, LPVOID, DWORD, LPDWORD);
typedef struct {
DWORD cbStructureSize;
LPVOID lpApplicationData;
LPVOID lpFilterData;
LPGETFILESIZEFUNC lpGetFileSizeFunc;
LPSETFILEPOSFUNC lpSetFilePosFunc;
LPREADFILEFUNC lpReadFileFunc;
} FILTERINFO, *LPFILTERINFO, STREAMINFO, *LPSTREAMINFO;
typedef struct {
int nFilledFrom;
int nFilledTill;
RSTATE rstate;
int fSkipDestIfUnk;
long cbBin;
long lParam;
RDS rds;
RIS ris;
SAVE *psave;
int nOverflowCount;
int cNibble;
int nHexValue;
int fNeg;
int nCharCount;
char *lpszDestBuffer;
LPDWORD lpdwDestCounter;
char szKeyword[30];
char szParameter[20];
char buffer[MAX_BUFFER_SIZE];
} FILTERDATA, *LPFILTERDATA;
// RTF parser declarations
void PushRtfState(LPFILTERDATA lpfd);
void PopRtfState(LPFILTERDATA lpfd);
void ParseChar(char c, LPFILTERDATA lpfd);
void TranslateKeyword(LPFILTERDATA lpfd);
void ParseSpecialKeyword(IPFN ipfn, LPFILTERDATA lpfd);
// Static const data:
// RTF parser tables - Keyword descriptions
SYM rgsymRtf[58] = {
"\x0a", kwdChar, 0x0d,
"\x0d", kwdChar, 0x0d,
"'", kwdSpec, ipfnHex,
"*", kwdSpec, ipfnSkipDest,
"\\", kwdChar, '\\',
"author", kwdDest, idestSkip,
"bin", kwdSpec, ipfnBin,
"buptim", kwdDest, idestSkip,
"colortbl", kwdDest, idestSkip,
"comment", kwdDest, idestSkip,
"creatim", kwdDest, idestSkip,
"doccomm", kwdDest, idestSkip,
"fonttbl", kwdDest, idestSkip,
"footer", kwdDest, idestSkip,
"footerf", kwdDest, idestSkip,
"footerl", kwdDest, idestSkip,
"footerr", kwdDest, idestSkip,
"footnote", kwdDest, idestSkip,
"ftncn", kwdDest, idestSkip,
"ftnsep", kwdDest, idestSkip,
"ftnsepc", kwdDest, idestSkip,
"header", kwdDest, idestSkip,
"headerf", kwdDest, idestSkip,
"headerl", kwdDest, idestSkip,
"headerr", kwdDest, idestSkip,
"info", kwdDest, idestSkip,
"keywords", kwdDest, idestSkip,
"ldblquote",kwdChar, '"',
"line", kwdChar, 0x0d,
"operator", kwdDest, idestSkip,
"par", kwdChar, 0x0d,
"pict", kwdDest, idestSkip,
"printim", kwdDest, idestSkip,
"private1", kwdDest, idestSkip,
"rdblquote",kwdChar, '"',
"revtim", kwdDest, idestSkip,
"rxe", kwdDest, idestSkip,
"stylesheet",kwdDest, idestSkip,
"subject", kwdDest, idestSkip,
"tab", kwdChar, 0x09,
"tc", kwdDest, idestSkip,
"title", kwdDest, idestSkip,
"txe", kwdDest, idestSkip,
"xe", kwdDest, idestSkip,
"{", kwdChar, '{',
"}", kwdChar, '}'
};
LPCSTR WINAPI GetFilterExtensions() {
return "RTF\000\000";
}
DWORD WINAPI CreateFilteredStream(LPSTREAMINFO lpsi) {
DWORD dwResult = ERROR_SUCCESS;
LPFILTERDATA lpfd;
lpsi->lpFilterData = NULL;
if (lpfd = malloc(sizeof(FILTERDATA))) {
lpsi->lpFilterData = lpfd;
ZeroMemory(lpfd, sizeof(FILTERDATA));
lpfd->cNibble = 2;
lpfd->nFilledTill = -1;
} else
dwResult = GetLastError();
return dwResult;
}
void ProcessSourceBuffer(LPFILTERDATA lpfd, DWORD dwBytesToRead) {
char ch;
while(lpfd->nFilledFrom <= lpfd->nFilledTill && *lpfd->lpdwDestCounter <= dwBytesToRead){
ch = lpfd->buffer[lpfd->nFilledFrom];
switch (lpfd->rstate) {
case rstNorm:
if (lpfd->ris == risBin) // if we're parsing binary data, handle it directly
ParseChar(ch, lpfd);
else {
switch (ch) {
case '{':
PushRtfState(lpfd);
break;
case '}':
PopRtfState(lpfd);
break;
case '\\':
lpfd->rstate = rstCtrl;
lpfd->fNeg = FALSE;
lpfd->szKeyword[0] = '\0';
lpfd->nCharCount = 0;
break;
case 0x0d:
case 0x0a: // cr and lf are noise characters...
break;
default:
if (lpfd->ris == risNorm)
ParseChar(ch, lpfd);
else { // parsing hex data
lpfd->nHexValue <<= 4;
if (ch > '9') {
ch = (ch | 0x20) - ('a' - '9' - 1);
}
lpfd->nHexValue += ch - '0';
lpfd->cNibble--;
if (!lpfd->cNibble) {
ParseChar((char)lpfd->nHexValue, lpfd);
lpfd->cNibble = 2;
lpfd->nHexValue = 0;
lpfd->ris = risNorm;
}
} // end else (ris != risNorm)
break;
} // switch (ch)
} // else (ris != risBin)
break;
case rstCtrl:
if (isalpha(ch)) { // still a keyword.
lpfd->szKeyword[lpfd->nCharCount] = (char) ch;
if (lpfd->nCharCount < sizeof(lpfd->szKeyword) - 1)
lpfd->nCharCount++;
} else {
if (lpfd->nCharCount == 0) { // a control symbol - no delimiter
lpfd->szKeyword[0] = (char) ch;
lpfd->szKeyword[1] = '\0';
TranslateKeyword(lpfd);
lpfd->rstate = rstNorm;
} else {
lpfd->szKeyword[lpfd->nCharCount] = '\0';
switch (ch) {
case '\\':
lpfd->lParam = 0;
TranslateKeyword(lpfd);
lpfd->szKeyword[0] = '\0';
lpfd->nCharCount = 0;
break;
case ' ':
lpfd->lParam = 0;
TranslateKeyword(lpfd);
lpfd->rstate = rstNorm;
break;
default:
lpfd->nFilledFrom--;
lpfd->rstate = rstVal;
lpfd->szParameter[0] = '\0';
lpfd->nCharCount = 0;
}
}
}
break;
case rstVal:
if (ch == '-') {
lpfd->fNeg = TRUE;
} else {
if (isdigit(ch)) {
lpfd->szParameter[lpfd->nCharCount] = (char) ch;
if (lpfd->nCharCount < sizeof(lpfd->szKeyword) - 1)
lpfd->nCharCount++;
} else {
lpfd->szParameter[lpfd->nCharCount] = '\0';
lpfd->lParam = atol(lpfd->szParameter);
if (lpfd->fNeg)
lpfd->lParam = -lpfd->lParam;
TranslateKeyword(lpfd);
if (ch == '\\') {
lpfd->rstate = rstCtrl;
lpfd->fNeg = FALSE;
lpfd->szKeyword[0] = '\0';
lpfd->nCharCount = 0;
} else {
if (ch != ' ') {
lpfd->nFilledFrom--;
}
lpfd->rstate = rstNorm;
}
}
break;
}
} // switch (rstate)
lpfd->nFilledFrom++;
}//while(lpfd->nFilledFrom<=lpfd->nFilledTill && *lpfd->lpdwDestCounter<=dwBytesToRead)
}
DWORD WINAPI ReadFilteredStream(LPSTREAMINFO lpsi, LPVOID lpBuffer,
DWORD dwBytesToRead, LPDWORD lpdwBytesRead) {
DWORD dwResult = ERROR_INVALID_PARAMETER;
LPFILTERDATA lpfd = lpsi->lpFilterData;
*lpdwBytesRead = 0;
if (lpfd) {
dwResult = ERROR_SUCCESS;
lpfd->lpszDestBuffer = lpBuffer;
lpfd->lpdwDestCounter = lpdwBytesRead;
ProcessSourceBuffer(lpfd, dwBytesToRead - 1);
if (*lpdwBytesRead < dwBytesToRead) { //else we do not have to read more source data
// here we are when the SourceBuffer is empty but still we are asked for more data
DWORD dwSourceBytesRead;
do {
dwResult = (lpsi->lpReadFileFunc)(lpsi, (LPVOID) lpfd->buffer,
sizeof(lpfd->buffer), &dwSourceBytesRead);
if (0 != dwResult) break;
lpfd->nFilledFrom = 0;
lpfd->nFilledTill = dwSourceBytesRead - 1;
ProcessSourceBuffer(lpfd, dwBytesToRead - 1);
} while (dwSourceBytesRead != 0 && *lpdwBytesRead < dwBytesToRead);
}
}
return dwResult;
}
DWORD WINAPI CloseFilteredStream(LPSTREAMINFO lpsi) {
if (lpsi->lpFilterData) {
LPFILTERDATA lpfd = lpsi->lpFilterData;
while (NULL != lpfd->psave)
PopRtfState(lpfd);
free(lpsi->lpFilterData);
lpsi->lpFilterData = NULL;
}
return 0;
}
void PushRtfState(LPFILTERDATA lpfd)
{
SAVE *psaveNew = malloc(sizeof(SAVE));
if (psaveNew == NULL) {
lpfd->nOverflowCount++;
return;
}
psaveNew -> pNext = lpfd->psave;
psaveNew -> rds = lpfd->rds;
psaveNew -> ris = lpfd->ris;
lpfd->ris = risNorm;
lpfd->psave = psaveNew;
return;
}
void PopRtfState(LPFILTERDATA lpfd)
{
SAVE *psaveOld;
if (!lpfd->psave)
return; //ignore unmatched '}'
if (lpfd->nOverflowCount != 0) {
lpfd->nOverflowCount--;
return;
}
lpfd->rds = lpfd->psave->rds;
lpfd->ris = lpfd->psave->ris;
psaveOld = lpfd->psave;
lpfd->psave = lpfd->psave->pNext;
free(psaveOld);
}
__inline void ParseChar(char ch, LPFILTERDATA lpfd)
{
if (lpfd->ris == risBin && --lpfd->cbBin <= 0)
lpfd->ris = risNorm;
if (rdsNorm == lpfd->rds) {
(lpfd->lpszDestBuffer)[(*lpfd->lpdwDestCounter)++] = ch;
}
return;
}
void TranslateKeyword(LPFILTERDATA lpfd)
{
int lb, rb, isym, cmpres;
// search for szKeyword in rgsymRtf
lb = 0;
rb = sizeof(rgsymRtf) / sizeof(SYM) - 1;
while (lb <= rb) {
isym = (lb + rb) >> 1;
cmpres = strcmp(lpfd->szKeyword, rgsymRtf[isym].szKeyword);
if (cmpres < 0)
if (isym < rb) rb = isym; else rb--;
else if (cmpres > 0)
if (isym > lb) lb = isym; else lb++;
else
break;
}
if (0 != cmpres) { // control word not found
if (lpfd->fSkipDestIfUnk) // if this is a new destination
lpfd->rds = rdsSkip; // skip the destination
// else just discard it
lpfd->fSkipDestIfUnk = FALSE;
return;
}
// found it! use kwd and idx to determine what to do with it.
lpfd->fSkipDestIfUnk = FALSE;
switch (rgsymRtf[isym].kwd) {
case kwdChar:
ParseChar((char)rgsymRtf[isym].idx, lpfd);
break;
case kwdDest:
lpfd->rds = rdsSkip;
break;
case kwdSpec:
ParseSpecialKeyword(rgsymRtf[isym].idx, lpfd);
}
return;
}
void ParseSpecialKeyword(IPFN ipfn, LPFILTERDATA lpfd)
{
if (lpfd->rds == rdsSkip && ipfn != ipfnBin) // if we're skipping, and it's not
return; // the \bin keyword, ignore it.
switch (ipfn) {
case ipfnBin:
lpfd->ris = risBin;
lpfd->cbBin = lpfd->lParam;
break;
case ipfnSkipDest:
lpfd->fSkipDestIfUnk = TRUE;
break;
case ipfnHex:
lpfd->ris = risHex;
break;
}
return;
}
Appendix B: Sample Source Code of HTML Filter in Borland Delphi
{$A-,B-,R-,Q-,I-,V-}
Library Htm2Txt;
Uses Windows;
Const
SupportedExtentions : array [0..10] of char = 'HTM'#0 + 'HTML'#0#0;
MaxSourceBufferLength = 65536; { 64K }
CR = #13; // Character Return
Space = ' ';
Type
StreamInfoPtrType = ^StreamInfoType;
GetFileSizeFuncType = function( StreamInfoPtr : StreamInfoPtrType;
var FileSizeLow,
FileSizeHigh : LongInt) : DWORD; stdcall;
SetFilePosFuncType = function( StreamInfoPtr : StreamInfoPtrType;
var OffsetLow,
OffsetHigh : LongInt;
MoveMethod : DWORD) : DWORD; stdcall;
ReadFileFuncType = function( StreamInfoPtr : StreamInfoPtrType;
var Buffer;
BytesToRead : LongInt;
var BytesRead : LongInt) : DWORD; stdcall;
StreamInfoType = record
cbStructureSize : DWORD;
lpApplicationData : Pointer;
lpFilterData : Pointer;
GetFileSizeFunc : GetFileSizeFuncType;
SetFilePosFunc : SetFilePosFuncType;
ReadFileFunc : ReadFileFuncType;
end;
Type
StatusType = (TextMode,
AboutHTMLTag,
InsideHTMLTag,
SpecialSymbolMode,
LongComment,
AboutLongComment1,
AboutLongComment2,
EndLongComment1,
EndLongComment2);
InternalBufferType = array[0..MaxSourceBufferLength-1] of char;
FilterDataType = record
SourceBuffer : InternalBufferType;
FilledFrom, FilledTill : Integer;
Status : StatusType;
LastChar : Char;
HeadIsNotPassedYet,
OutsideHead : Boolean;
SpecialString,
HTMLTag : ShortString
end;
FilterDataPtrType = ^FilterDataType;
ZeroBasedArrayOfChar = array[0..0] of char;
ZeroBasedArrayOfCharPtr = ^ZeroBasedArrayOfChar;
function GetFilterExtensions : PChar; export; stdcall;
begin { GetFilterExtensions }
GetFilterExtensions := @SupportedExtentions[0]
end; { GetFilterExtensions }
function CreateFilteredStream(var StreamInfo : StreamInfoType) : DWORD; export; stdcall;
begin { CreateFilteredStream }
StreamInfo.lpFilterData := Nil;
try
GetMem(StreamInfo.lpFilterData, SizeOf(FilterDataType));
with FilterDataPtrType(StreamInfo.lpFilterData)^ do begin
FilledFrom := 0;
FilledTill := -1;
Status := TextMode;
SpecialString := '';
HTMLTag := '';
LastChar := Space;
HeadIsNotPassedYet := True;
OutsideHead := True;
end;
CreateFilteredStream := 0
except
CreateFilteredStream := ERROR_OUTOFMEMORY
end
end; { CreateFilteredStream }
procedure ProcessSourceBuffer(var FilterData : FilterDataType;
var DestBuffer : ZeroBasedArrayOfChar;
var DestFillFrom : Integer;
DestFillTill : Integer);
Type SymbolsConversionType = record
HTMLReprtesentation : ShortString;
SSSReprtesentation : char;
end;
Const TotalSpecialSymbols = 69;
StandardSpecialString : array[1..TotalSpecialSymbols] of SymbolsConversionType =
(
(HTMLReprtesentation: 'quot'; SSSReprtesentation: '"'),
(HTMLReprtesentation: 'amp'; SSSReprtesentation: '&'),
(HTMLReprtesentation: 'lt'; SSSReprtesentation: '<'),
(HTMLReprtesentation: 'gt'; SSSReprtesentation: '>'),
(HTMLReprtesentation: 'nbsp'; SSSReprtesentation: ' '),
(HTMLReprtesentation: 'copy'; SSSReprtesentation: #169),
(HTMLReprtesentation: 'shy'; SSSReprtesentation: #173),
(HTMLReprtesentation: 'Agrave'; SSSReprtesentation: #192),
(HTMLReprtesentation: 'Aacute'; SSSReprtesentation: #193),
(HTMLReprtesentation: 'Acirc'; SSSReprtesentation: #194),
(HTMLReprtesentation: 'Atilde'; SSSReprtesentation: #195),
(HTMLReprtesentation: 'Auml'; SSSReprtesentation: #196),
(HTMLReprtesentation: 'Aring'; SSSReprtesentation: #197),
(HTMLReprtesentation: 'AElig'; SSSReprtesentation: #198),
(HTMLReprtesentation: 'Ccedil'; SSSReprtesentation: #199),
(HTMLReprtesentation: 'Egrave'; SSSReprtesentation: #200),
(HTMLReprtesentation: 'Eacute'; SSSReprtesentation: #201),
(HTMLReprtesentation: 'Ecirc'; SSSReprtesentation: #202),
(HTMLReprtesentation: 'Euml'; SSSReprtesentation: #203),
(HTMLReprtesentation: 'Igrave'; SSSReprtesentation: #204),
(HTMLReprtesentation: 'Iacute'; SSSReprtesentation: #205),
(HTMLReprtesentation: 'Icirc'; SSSReprtesentation: #206),
(HTMLReprtesentation: 'Iuml'; SSSReprtesentation: #207),
(HTMLReprtesentation: 'ETH'; SSSReprtesentation: #208),
(HTMLReprtesentation: 'Ntilde'; SSSReprtesentation: #209),
(HTMLReprtesentation: 'Ograve'; SSSReprtesentation: #210),
(HTMLReprtesentation: 'Oacute'; SSSReprtesentation: #211),
(HTMLReprtesentation: 'Ocirc'; SSSReprtesentation: #212),
(HTMLReprtesentation: 'Otilde'; SSSReprtesentation: #213),
(HTMLReprtesentation: 'Ouml'; SSSReprtesentation: #214),
(HTMLReprtesentation: 'Oslash'; SSSReprtesentation: #216),
(HTMLReprtesentation: 'Ugrave'; SSSReprtesentation: #217),
(HTMLReprtesentation: 'Uacute'; SSSReprtesentation: #218),
(HTMLReprtesentation: 'Ucirc'; SSSReprtesentation: #219),
(HTMLReprtesentation: 'Uuml'; SSSReprtesentation: #220),
(HTMLReprtesentation: 'Yacute'; SSSReprtesentation: #221),
(HTMLReprtesentation: 'THORN'; SSSReprtesentation: #222),
(HTMLReprtesentation: 'szlig'; SSSReprtesentation: #223),
(HTMLReprtesentation: 'agrave'; SSSReprtesentation: #224),
(HTMLReprtesentation: 'aacute'; SSSReprtesentation: #225),
(HTMLReprtesentation: 'acirc'; SSSReprtesentation: #226),
(HTMLReprtesentation: 'atilde'; SSSReprtesentation: #227),
(HTMLReprtesentation: 'auml'; SSSReprtesentation: #228),
(HTMLReprtesentation: 'aring'; SSSReprtesentation: #229),
(HTMLReprtesentation: 'aelig'; SSSReprtesentation: #230),
(HTMLReprtesentation: 'ccedil'; SSSReprtesentation: #231),
(HTMLReprtesentation: 'egrave'; SSSReprtesentation: #232),
(HTMLReprtesentation: 'eacute'; SSSReprtesentation: #233),
(HTMLReprtesentation: 'ecirc'; SSSReprtesentation: #234),
(HTMLReprtesentation: 'euml'; SSSReprtesentation: #235),
(HTMLReprtesentation: 'igrave'; SSSReprtesentation: #236),
(HTMLReprtesentation: 'iacute'; SSSReprtesentation: #237),
(HTMLReprtesentation: 'icirc'; SSSReprtesentation: #238),
(HTMLReprtesentation: 'iuml'; SSSReprtesentation: #239),
(HTMLReprtesentation: 'eth'; SSSReprtesentation: #240),
(HTMLReprtesentation: 'ntilde'; SSSReprtesentation: #241),
(HTMLReprtesentation: 'ograve'; SSSReprtesentation: #242),
(HTMLReprtesentation: 'oacute'; SSSReprtesentation: #243),
(HTMLReprtesentation: 'ocirc'; SSSReprtesentation: #244),
(HTMLReprtesentation: 'otilde'; SSSReprtesentation: #245),
(HTMLReprtesentation: 'ouml'; SSSReprtesentation: #246),
(HTMLReprtesentation: 'oslash'; SSSReprtesentation: #248),
(HTMLReprtesentation: 'ugrave'; SSSReprtesentation: #249),
(HTMLReprtesentation: 'uacute'; SSSReprtesentation: #250),
(HTMLReprtesentation: 'ucirc'; SSSReprtesentation: #251),
(HTMLReprtesentation: 'uuml'; SSSReprtesentation: #252),
(HTMLReprtesentation: 'yacute'; SSSReprtesentation: #253),
(HTMLReprtesentation: 'thorn'; SSSReprtesentation: #254),
(HTMLReprtesentation: 'yuml'; SSSReprtesentation: #255)
);
TotalLineBreakingTags = 37;
LineBreakingTag : array[1..TotalLineBreakingTags] of ShortString =
('P', '/P', 'BR', 'CENTER', '/CENTER',
'HR', 'LI', 'TABLE', '/TABLE', 'DIR',
'/DIR' , 'DIV', '/DIV', 'FORM', '/FORM',
'FRAME', 'FRAMESET', '/FRAMESET', 'H1', 'H2',
'H3', 'H4', 'H5', 'H6', 'H7',
'/H1', '/H2', '/H3', '/H4', '/H5',
'/H6', '/H7', 'OL', '/OL', 'TR',
'UL', '/UL');
var C : Char;
Found : Boolean;
i, j : Integer;
begin
with FilterData do begin
while (FilledFrom <= FilledTill) and (DestFillFrom <= DestFillTill) do begin
C := SourceBuffer[FilledFrom];
case Status of
TextMode : case C of
'<' : Status := AboutHTMLTag;
'&' : begin
Status := SpecialSymbolMode;
SpecialString := ''
end;
else if OutsideHead and ((LastChar > Space) or (C > Space)) then begin
DestBuffer[DestFillFrom] := C;
Inc(DestFillFrom);
LastChar := C
end
end;
AboutHTMLTag : case C of
'!' : Status := AboutLongComment1;
'>' : Status := TextMode;
else begin
HTMLTag[0] := #1;
HTMLTag[1] := UpCase(C);
Status := InsideHTMLTag
end
end;
AboutLongComment1 : case C of
'-' : Status := AboutLongComment2;
'>' : Status := TextMode;
else begin
HTMLTag[0] := #2;
HTMLTag[1] := '!';
HTMLTag[2] := UpCase(C);
Status := InsideHTMLTag
end
end;
AboutLongComment2 : case C of
'-' : Status := LongComment;
'>' : Status := TextMode;
else begin
HTMLTag[0] := #3;
HTMLTag[1] := '!';
HTMLTag[2] := '-';
HTMLTag[3] := UpCase(C);
Status := InsideHTMLTag
end
end;
EndLongComment1 : case C of
'-' : Status := EndLongComment2;
else Status := LongComment;
end;
EndLongComment2 : case C of
'-' : { nothing };
'>' : Status := TextMode;
else Status := LongComment;
end;
InsideHTMLTag : if C = '>' then begin
Status := TextMode; // always
if HeadIsNotPassedYet then begin
if HTMLTag = 'HEAD' then begin
OutsideHead := False;
Inc(FilledFrom);
Continue
end;
if (HTMLTag = '/HEAD')
or
(
(Pos('BODY', HTMLTag) = 1) and
((HTMLTag = 'BODY') or (HTMLTag[5] <= Space))
)
then begin
OutsideHead := True;
HeadIsNotPassedYet := False;
Inc(FilledFrom);
Continue
end;
if HTMLTag = 'TITLE'
then OutsideHead := True else
if HTMLTag = '/TITLE' then begin
OutsideHead := False;
DestBuffer[DestFillFrom] := CR;
Inc(DestFillFrom);
LastChar := CR;
Inc(FilledFrom);
Continue
end;
end { HeadIsNotPassedYet };
if OutsideHead then begin
Found := False;
for i := 1 to TotalLineBreakingTags do
if (LineBreakingTag[i][1] = HTMLTag[1]) and
(Pos(LineBreakingTag[i], HTMLTag) = 1) and
(
(LineBreakingTag[i][0] = HTMLTag[0]) or
(HTMLTag[Byte(LineBreakingTag[i][0]) + 1] <= Space)
)
then begin
DestBuffer[DestFillFrom] := CR;
Inc(DestFillFrom);
LastChar := CR; // no matter if it is CR or LF
Found := True;
Break
end;
if not Found and (LastChar > Space) then begin
DestBuffer[DestFillFrom] := Space;
Inc(DestFillFrom);
LastChar := Space
end
end { OutsideHead }
end else HTMLTag := HTMLTag + UpCase(C);
SpecialSymbolMode : if C in [#0..Space, ';', '<', '&', '>'] then begin
case C of
#0..Space, ';', '>' : Status := TextMode;
'<' : Status := AboutHTMLTag;
'&' : SpecialString := ''; {Status is already SpecialSymbolMode}
end;
if OutsideHead then begin
Found := False;
if (Length(SpecialString) > 0) and (SpecialString[1] = '#') then begin
Delete(SpecialString, 1, 1);
Val(SpecialString, i, j);
if j = 0 then begin
DestBuffer[DestFillFrom] := Char(i);
Inc(DestFillFrom);
LastChar := Char(i);
Found := True
end
end;
if Not Found then
for i := 1 to TotalSpecialSymbols do
if SpecialString = StandardSpecialString[i].HTMLReprtesentation
then begin
Found := True;
DestBuffer[DestFillFrom] := StandardSpecialString[i].SSSReprtesentation;
Inc(DestFillFrom);
LastChar := StandardSpecialString[i].SSSReprtesentation;
Break
end;
if Not Found then begin
DestBuffer[DestFillFrom] := Space;
Inc(DestFillFrom);
LastChar := Space
end
end { OutsideHeader }
end else SpecialString := SpecialString + C;
LongComment : if C = '-' then Status := EndLongComment1;
end { case Status };
Inc(FilledFrom)
end { while (FilledFrom <= FilledTill) and (DestFillFrom <= DestFillFrom) }
end { with FilterData }
end;
function ReadFilteredStream(var StreamInfo : StreamInfoType;
var Buffer;
BufferSize : LongInt;
var BytesRead : LongInt) : DWORD; stdcall;
var SourceBytesRead : LongInt;
FilterDataPtr : FilterDataPtrType;
begin { ReadFilteredStream }
BytesRead := 0;
FilterDataPtr := StreamInfo.lpFilterData;
if FilterDataPtr = Nil then begin
Result := ERROR_INVALID_PARAMETER;
Exit;
end;
Result := 0;
{ processing the contents of SourceBuffer left during one of previous calls }
ProcessSourceBuffer(FilterDataPtr^,
ZeroBasedArrayOfCharPtr(@Buffer)^,
BytesRead,
BufferSize-1);
if BytesRead >= BufferSize then { we do not have to read more source data, so: } Exit;
{ here we are when the SourceBuffer is empty but still we are asked for more data }
repeat
Result := StreamInfo.ReadFileFunc(@StreamInfo,
FilterDataPtr^.SourceBuffer,
MaxSourceBufferLength,
SourceBytesRead);
if Result <> 0 then { error reading data, so: } Exit;
with FilterDataPtr^ do begin
FilledFrom := 0;
FilledTill := SourceBytesRead - 1
end;
ProcessSourceBuffer(FilterDataPtr^,
ZeroBasedArrayOfCharPtr(@Buffer)^,
BytesRead,
BufferSize-1)
until (SourceBytesRead = 0) or (BytesRead >= BufferSize);
end; { ReadFilteredStream }
function CloseFilteredStream(var StreamInfo : StreamInfoType) : DWORD; export; stdcall;
begin { CloseFilteredStream }
Result := ERROR_INVALID_PARAMETER;
if StreamInfo.lpFilterData <> Nil then begin
try
FreeMem(StreamInfo.lpFilterData, SizeOf(FilterDataType));
StreamInfo.lpFilterData := Nil;
Result := 0
except
end
end
end; { CloseFilteredStream }
Exports GetFilterExtensions index 1,
CreateFilteredStream index 2,
ReadFilteredStream index 3,
CloseFilteredStream index 4;
Begin
End.
|
Home | Download | Purchase | Contact | Company
Copyright © 1997-2008 Kryloff Technologies, Inc. All Rights Reserved