unicode - Using Markov models to convert all-caps to mixed-case and related problems -

- January 15, 2014

i've been thinking using markov techniques restore missing information natural language text.

restore all-caps text mixed-case.
restore accents / diacritics languages should have them have been converted plain ascii.
convert rough phonetic transcriptions native alphabets.

that seems in order of least difficult difficult. problem resolving ambiguities based on context.

i can use wiktionary dictionary , wikipedia corpus using n-grams , hidden markov models resolve ambiguities.

am on right track? there services, libraries, or tools sort of thing?

examples

george lost sim card in bush ⇨ george lost sim card in bush
tantot il rit gorge deployee ⇨ tantôt il rit à gorge déployée

i think can use markov models (hmms) 3 tasks, take @ more modern models such conditional random fields (crfs). also, here's boost google-fu:

restore mixed case text in caps

this called truecasing.

restore accents / diacritics languages should have them have been converted plain ascii

i suspect markov models going have hard time on this. otoh, labelled training data free since can take bunch of accented text in target language , strip accents. see next answer.

convert rough phonetic transcriptions native alphabets

this seems related machine transliteration, has been tried using pair hmms (from bioinformatics/genome work).

Search This Blog

TY

Featured post

c# - Usage of Server Side Controls in MVC Frame work -

unicode - Using Markov models to convert all-caps to mixed-case and related problems -

Comments

Post a Comment

Popular posts from this blog

c# - Usage of Server Side Controls in MVC Frame work -

html - Difference between button and input? -

javascript - Problem in loading a document in the same page using href="" in Jquery-mobile -