This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Truecasing" – news · newspapers · books · scholar · JSTOR (October 2010) (Learn how and when to remove this message)

Truecasing, also called capitalization recovery,[1] capitalization correction,[2] or case restoration,[3] is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages).

Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the Latin, Greek, Cyrillic or Armenian alphabets, such as Korean, Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, and Georgian.

Techniques

Applications

Truecasing aids in other NLP tasks, such as named entity recognition (NER), automatic content extraction (ACE), and machine translation.[4] Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.

See also

References

  1. ^ Brown, Eric W.; Coden, Anni R. (2002). "Capitalization Recovery for Text". Information Retrieval Techniques for Speech Applications. Lecture Notes in Computer Science. Vol. 2273. pp. 11–22. doi:10.1007/3-540-45637-6_2. ISBN 978-3-540-43156-5.
  2. ^ US patent 7,827,025 B2, Peter K. L. Mau & Dong Yu, "Efficient capitalization through user modeling", issued 2010-11-02, assigned to Microsoft Corporation 
  3. ^ US patent 8,972,855 B2, Zhu Liu; David Gibbon & Behzad Shahraray, "Method and apparatus for providing case restoration", issued 2015-03-03, assigned to AT&T Intellectual Property I, L.P. 
  4. ^ Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.