___ __ __ ___ _ | _ ) _ _ _ _ __ _ _ _ | \/ | __ | __| | | ___ _ _ ___ _ _ | _ \ | '_| | || | / _` | | ' \ | |\/| | / _| | _| | | / -_) | ' \ / -_) | || | |___/ |_| \_, | \__,_| |_||_| |_| |_| \__| |___| |_| \___| |_||_| \___| \_, | |__/ |__/
Proofreading text is usually done by slowly reading a text to identify grammar and spelling mistakes. It would be nice if there were some intelligent software to identify those mistakes, or at least highlight phrases where a mistake is likely. Some word processors come with a grammar and spell checker but these are rule and dictionary-based, and often the results aren’t very good.
Could a statistical, language-model based proofreader do better? Since the approach is statistical rather than rule-based, the results will certainly be different, and will be much more informed about all of the world-knowledge and exceptions to the rules that determine natural language.
Surprisingly, it’s still an open question but perhaps that’s because most scientists are only bothered answering a question when the answer is yes!
Five years ago, as an exercise in recreational programming, I developed a “category-based language model” for this task, in C++. Category-based means that we use categories that are similar to grammatical classes but finer grained. A category based model requires less training data than a word-based model since the data for a group of words is pooled into one category. The categories are automatically determined by grouping words whose contexts are similar in the training text.
Language models are commonly applied to another task in natural language processing, machine translation. They are used to determine the most probable word order of a translated phrase, in conjunction with a statistical dictionary, which determines the most probable word for word translation of the particular words.
I ran out of time to evaluate that software. I suspect that if it had a lot of training material and computer hardware to build the language model, it would have been quite useful. I used an IBM Thinkpad 240, 300Mhz with 320MB of RAM at the time, with a corpus of a dozen or so out-of-copyright novels. The software was able to notice stupid things like two verbs in a row without much difficulty.