DITECT Spelling Check
§ 1 In general
DITECT is a subroutine-system to be integrated into a word-processing-or type-
setting program to check written text for spelling mistakes.
DITECT helps user to quickly correct wrong text words in three different ways:
a) Finding error.
DITECT finds miss-spelled expressions in a split of a second, much faster than any
human being is possible to, especially with long text files.
b) Recognizing error type.
When DITECT has marked a spelling error, user needs some time to find out
what is wrong, especially with long words or expressions looking correctly on
first glance.
But DITECT helps to recognize the type of error in many ways:
- Direct pointer to error position,
e.g.: "wrong expresion".
- List of proposal words.
|
- Various error-markings depending on type of error |Proposal list
1) General spelling error | yes
2) Incorrect small initial letter at start of sentence | no
" " " " within sentence | no
3) Incorrect capital initial letter | no
4) Double words. | no
5) Space character is missing or double before word | no
6) Unwanted spelling *) | yes
7) Automatically replaced expression *) | no
*) defined by user |
When textsystem is able to mark these error types in different ways, e.g. in dif-
ferent colours, user at once knows type and position of this spelling error.
Even recognizing and storing (learning) of words unknown to DITECT now is very
easy, as this may only occur with first error type (general spelling error).
c) Error correction
by user now is done quickly as DITECT points directly to position of error.
1. Operating-Display
Following display-example is only a suggestion to demonstrate, how DITECT is
able to help the user to recognize error-type and -position for his correction:
As there are longish words in several languages, user needs much time to find
out the position and type of an error in a marked word.
Besides that errors of type 2-6 are only marked at start of word and in case of
error type 4-6 the marked word seems to be correct. This is the reason why a
thorough error description is very important, else the user perhaps doesn't see
the error and stores the word into exception dictionary for "learning" instead
of correcting it.
Display Description
The text displayed in text window is passed over to DITECT by the calling system
and - in case of an error or unknown word - DITECT returns an index to the error-
position together with type of error.
Calling system is now able to set the cursor directly to the error position.
When the system doesn't allow direct correction within the text, the erroneous
word is displayed with some context in error-window (1) and the cursor (2) is
set to the error-position.
In field (3) the type of error is discribed and in field (4) some proposals for
easier correction (5-7) are displayed.
More proposals may be found by scrolling down.
As it doesn't make sense to display a proposal list in case of error-types 4-6
or to store these words for learning, the proposal list field (4-7) and the
unknown expression field (8-11) should be closed.
Now it is easy for the user to decide, what has to be done:
If erroneous the user can directly correct it or he may click on one of
the proposals (5-7) to replace the marked word by it.
If unknown, after clicking on one of the fields (9-11), DITECT can:
a) learn permanently (s. 1.2: important expressions) or
b) learn temporarily (s. 1.2: unimportant expressions) or
c) ignore it (next time the word is marked again !)
| DITECT display | Text-window
|--------------------------------------------------------------| -----------
| Erstaunt stellten schwedische Forscher von der Universität |
| |
| Stockholm fest, dass beim Kompostieren von Gartenabfällen |
| |
| der Dioxngehalt auf das Dreifache Der normalen Umweltbelas- |
| * * |
| tung ansteigt. Die Giftmenge ist nicht akut gefährlich, aber |
| |
| dss die Horrorchemikalir durch biologische Prozesse akti- |
| * * |
| viert wird, ist neu. |
| |
| |
| | Error-window
| | ------------
|--------------------------------------------------------------| error-type:
| der Dioxngehalt auf das Dreifache Der normalen Umweltbela... | 1
|---------*----------------------------------------------------| 2
| General spelling error | 3
|------------------------------------|-------------------------|
| Proposal list | unknown expression | 4 | 8
|------------------------------------|-------------------------| |
| Dioxingehalt | x learn permanently | 5 | 9
| dioxinhaltig | x learn temporarily | 6 | 10
| dioxinbelastet | x ignore it | 7 | 11
|------------------------------------|-------------------------|
2. Dictionaries
DITECT uses a strongly compressed binary file (compression-rate 1 : 4) as
base-dictionary that cannot be changed or updated by user.
Based on this dictionary and special program-algorithms to handle word-endings
and compounds, DITECT is able to recognize e.g. for German language far more
than 4 Mio. words.
Besides that these base-dictionaries are constantly increased by us whenever
new words are found.
New words unknown to DITECT may be stored in permanent exception-file by
user any time.
Parts of text not found in dictionary or exception-file by DITECT are marked as
errors. User may decide if these words are really incorrect or correct.
When such a word is correct, user may store it immediately so it is known to
DITECT from then on.
Before storing user has to decide between unimportant and important words.
Unimportant words, such as foreign names a.o. in most cases are only used short-
term and seldom occur later. Words like that are stored short-term so that
DITECT will not mark them as erroneous on every occurrence again.
User may decide wether or not to erase them at end of job.
Important words are stored permanently in exception-file.
Words like that are
known to DITECT just like the words in base-dictionary.
Abbreviation dots have to be stored as well: Prof. Str.
Single letters are ignored by DITECT and so must not be stored:
not N.Mex. but only Mex.
Abbreviation dots are end-of-word-characters, so abbreviated comb.-words have
to be stored with their wordparts:
Not: comb.-words but: comb. and words
3. Checking of capital/small initial letters
Typesetting-system may define a single word, a sentence or the entire text for
spell-checking by DITECT.
When there is at least one blank in text area, DITECT thinks this to be at
least one text-sentence.
In this case, using special criterions, DITECT tries to find other sentences
to be able, not only to check spelling and capital- or small-writing of words
but also of inital letter at start of sentence.
Problem-cases not matching these criterions are not recognized by DITECT and
therefore might be marked as incorrect capital writing.
If user wants so, words with up to four capital letters are not checked, e.g.
GB, DM, USA, XYTV a.s.o. as these are special expressions like company names
where all letter combinations are possible.
Capital initial letter writing of nomilized verbs can't be recognized correctly
in all cases !
§ 2 Treatment of hyphens
If there is a hyphen (-) at end of line (|), there are 3 possibilities:
1. second part of word is written with small initial:
It is a hyphen to split the word at end of line.
Both hyphen and end-of-line are ignored: Zeilen-|ende
==> Zeilenende
2. second part of word is written with capital initial:
2.1 It is a combined-word-hyphen (s. 4).
Only end-of-line character is ignored: Jo-|Ann
==> Jo-Ann
3. Hyphen-character (-) or dash (/) is defined as hex. 002D
in code file "DTCOnn" (meaning as under 2. ).
§ 3 Word-combinations
not stored in dictionary
In many languages there are word-combinations such as the following ones.
DITECT in many cases is able to correctly recognize such expressions even when
they are not stored in dictionary:
1. Combined expressions
Gustav-Peter If not found totally, search for second
AEG-Mannschaft expression starting after hyphen -
when switch "mexsw" = 1 or 2:
Gustav, Peter, AEG, Mannschaft.
Brokat- und Seidenstoffe
Brokat-/Seidenstoffe
Lesungs- und Messungs-Rat Combination-s is O.K. in special cases,
even when it is not a normal ending.
2. Compound words
Petermann Compound words not stored are found by their
Stadtthemen single word parts when switch "mexsw"=2 or 6:
Peter, Mann, Stadt, Themen .
3. Rules for compound word recognition
Symbols Explanation
aaa bbb words with small initial letters, e.g. verbs
Ccc Ddd words with capital initial letters, e.g. Substantives
Compound word valid invalid
aaabbb x
aaaCcc x
CccDdd x
Cccddd x)
x) Minimum length of word compounds (default=4) may be redefined by user.
Following these rules recognition of missing word gaps is possible with high
accuracy.
4. Suffixed words not stored in dictionary
DITECT very often is able to correctly recognize words with suffixes not stored
in dictionary. When e.g. the German word lustig is stored in dictionary with-
out all the other possible endings, DITECT is able to recognize also:
lustig- e em en er ere erem eren erer eres es ste stem sten ster stes
With the abilities described under § 3 DITECT is capable to correctly recognize
many more words than stored in dictionary, as in many languages words are com-
posed by wordcombinations and suffixes.
Besides that, new creations of words are born daily mostly by combining words.
Every other spell-checker that is only based on words stored in dictionary is un-
able to recognize these new creations.
5. Email- / Internet addresses
Email- and Web-addresses are combinations of special expressions combined by
signs as . - _ /
e.g. spell checking web-address http://www.ub-dieck.com/dtgeneng.htm would
cause 7 error stops at: "http", "www", "ub", "dieck", "com", dtgeneng" and "htm" !
As it makes no sense to spell check expressions like that, DITECT is able to
ignore them when they or specific parts of them are stored in file dtexpr.skp.
§ 4 Proposal word list
in case of error.
When DITECT marks a word as erroneous or incorrect, it extracts max. 20 of
the most similar words from dictionary.
These words are starting with a number indicating percentage of similarity and
may be used as proposal for correction, e.g.:
Desperat (= incorrect spelling !)
% proposed words
98 desperat
77 Desperado
77 Desiderat
A special algorithm is used to find proposal words with high accuracy even when
in an incorrectly written word some letters are missing, to much or twisted.
Zustimug = incorrect spelling !
% proposed words
66 Zustimmung
62 zustimme
62 zustimmt
56 Zustimmens
56 zustimmen
56 zustimmst
56 zustimmte
50 zustimmend
Unrecognized errors or unwanted words
On creating large dictionaries, some incorrect entries are always possible.
So if user detects a miss-spelled or unwanted writing, he may store this with
ending / * or # into permanent exception file and DITECT will then mark it as
incorrect. Ending * or # may also serve as abbreviation sign.
With ending # an abbreviation is limited to 2 more letters, while with ending *
an abbreviation is unlimited, e.g.
crude# causes that crudely is marked as incorrect but not crudeness, while
crude* causes all words starting with crude to be marked as incorrect.
Such a refused expression (e.g: Photo) may be expanded by a proposal
(e.g: Foto) like this: Photo/Foto/*
With ending # or * the proposal is expanded (if necessary) and is displayed in
the proposal list.
So e.g. with text word Photoatelier the proposal Fotoatelier will be displayed.
A refused* or refused/proposed/* expression may contain blanks as well, e.g.:
am Besten/am besten/* (if switch "mexsw" +8).
The calling program replaces "refusal" by "proposal", when exception expression
does not end with * but with . (Dot), e.g.: mdb/Mitglied der Gemeinde/.
and error-no. 7 is returned.
See description: Exception Dictionary
Some miss-spelling examples in German exception file:
Photo/Foto/*
am Besten/am besten/*
Vaterliebe
Vaterl*
faß/fass/*
faßt/fasst/*
paralell*
Attention:
Example above means, that all words starting with "Vaterl" are not allowed
except of "Vaterliebe" which is accepted !
§ 5. Exception-files
When a word is marked as incorrect by DITECT, it either is
1. incorrect: so user has to correct it.
or
2. correct but unimportant,
e.g. a foreign name:
2.1 It is ignored by user and DITECT will
mark it again at every occurrence,
2.2 or user stores it "short-term".
or
3. correct and important: User stores it "medium-term"
( and automatically "short-term" ).
Name of "short-term" file(s) is DTnnTMP.* (nn =language-no.).
Every word unknown to DITECT is automatically searched and - if not found -
is stored here. Storage is done in a special fast-access-method.
This file cannot be edited, as it is in binary format.
Using software-switch 'ftmp', user may decide, when to erase this file by
typesetting-system, e.g. at end of job or after permanent storage of
"medium-term" file DTnnEXC.*
As "short-term" file contains lots of unimportant words, it should not be
kept longer than necessary and should not be growing to much, as otherwise
program performance may be decreasing.
Name of "medium-term" file(s) is DTnnEXC.*
Words are not searched in this file but sequentially stored, no matter how often
typesetting-system is started new, until user stores it permanently by:
DTEXA nn
After this, files DTnnEXC.* and DTnnTMP.* are automatically erased.
Before using DTEXA nn user may edit file(s) DTnnEXC.* for last corrections.
Network - Files
If not defined by user, DITECT automatically assigns an unused number (1 - 999)
to every workstation for short or medium-term files. Program-call
DTALLMED nn
(nn=language-no.) copies all medium-term files into file DTnnEXC and releases
the numbers for later use.
§ 6 Permanent exception-file
Permanent exception-file DTEXnn.TXT may be updated by following
batchprogram-calling:
DTEXD nn ( Display words, build catalogue )
or
DTEXA nn ( Add words, build catalogue )
Calling DTEXD displays the entire file using editor (PE2), to permit
modifications of file by user.
Please note, that there has to be correct capital/small initial-letter-writing.
Abbreviations are allowed with ending colon.
Apostrophe ('), combined-word-hyphen (-) and dash (/) within a word are
allowed as well. After returning from editor, the file is automatically checked
for incorrect characters and - if it is o.k. - is sorted.
An error-text enclosed in apostrophe is added at end of all incorrect words
and file-editor is started again for word-correction
(see: Exception Dictionary
Calling DTEXA file DTnnEXC.* is automatically added to file DTEXnn.TXT.
From then on it is working like DTEXD .
After this, files DTnnEXC.* and DTnnTMP.* are automatically erased.
( nn = 2-digits (!) language-no. )
Exception Dictionary
Software Interface
References, testreport excerpts
Contact