Featured post
unicode - Using Markov models to convert all-caps to mixed-case and related problems -
i've been thinking using markov techniques restore missing information natural language text.
- restore all-caps text mixed-case.
- restore accents / diacritics languages should have them have been converted plain ascii.
- convert rough phonetic transcriptions native alphabets.
that seems in order of least difficult difficult. problem resolving ambiguities based on context.
i can use wiktionary dictionary , wikipedia corpus using n-grams , hidden markov models resolve ambiguities.
am on right track? there services, libraries, or tools sort of thing?
examples
- george lost sim card in bush ⇨ george lost sim card in bush
- tantot il rit gorge deployee ⇨ tantôt il rit à gorge déployée
i think can use markov models (hmms) 3 tasks, take @ more modern models such conditional random fields (crfs). also, here's boost google-fu:
- restore mixed case text in caps
this called truecasing.
- restore accents / diacritics languages should have them have been converted plain ascii
i suspect markov models going have hard time on this. otoh, labelled training data free since can take bunch of accented text in target language , strip accents. see next answer.
- convert rough phonetic transcriptions native alphabets
this seems related machine transliteration, has been tried using pair hmms (from bioinformatics/genome work).
- Get link
- X
- Other Apps
Comments
Post a Comment