regex - Advanced text replacement (cloze deletion) -
well, i'd replace specific texts based on text, yeah sounds funny, here is.
the problem how replace tab-separated values. essentially, i'd replace matching vocabulary string found on sentence {...}
.
the value before tab \t
vocab, value after tab sentence. value on left of \t
first column, right second column
tl;dr version (english version)
essentially, want replace text on second column based on first column.
examples:
abcd \t 19475abcd_97jdhgbl
turn
abcd \t 19475{...}_97jdhgbl
abcd
first column here , 19475abcd_97jdhgbl
second one.
if don't context of long version below, solving abcd problem fine me. think it's quite simple code given it's been 4 years since last coded in c , i've started learning python, can't it.
long version: (japanese-specific text)
1. case 1: (for pure kanji)
全部 \t それ、全部ください。
become
全部 \t それ、{...}ください。
2. case 2: (for pure kana)**
ああ \t ああうるさい人は苦手です。
become
ああ \t {...}うるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
become
あいづち \t 彼の話に私は{...}を打ったの。
for case 1
, case 2
has exact matches, kana because otherwise might replace other kana in sentence. coding case 3
has different (see next).
3. case 3: (for mixed kana , kanji)
complex one. one, i'd script/solution change matching strings, i.e., ignore doesn't match , replace found matches. takes longest possible match , replace accordingly.
上げる \t 彼は荷物をあみだなに上げた。
become
上げる \t 彼は荷物をあみだなに{...}た。
note here first column has 上げる
second column has 上げた
because has changed in tense (first column has る while second 1 has た).
so, ideally solution should take longest string found in both columns, in case 上げ
, string replaced {...}
, while leaves た
.
another example
が増える \t 値段がが増える
become
が増える \t 値段が{...}
more tl;dr
i'm using anki.
i use excel or notepad++ don't think replace text based on placeholders.
my goal here create pseudo-cloze sentences can use hints hidden in hint field used ridiculously hard synonyms or homonyms (i have auditory card).
i know i'm missing fourth case, i.e., pure kana possibility of sentence having changed tense, hence spelling. well, that'd hard code i'd rather manually not mess other kana in sentence.
update
forgot text contained in .txt file in format:
全部 \t それ、全部ください。
ああ \t ああうるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
上げる \t 彼は荷物をあみだなに上げた。
there 7000 lines of things has check replacements every line.
code works, thanks, minor bug sentences including non-full replacements, creates broken characters.
上げたxxxx 彼は荷物をあみだなに上げあ。 abcd abcd123 86876 xx86876h897 全部 それ、全部ください ああ ああうるさい人は苦手です。 上げたxxxx 彼は荷物をあみだなに上げあ。 務める ああうるさい人は苦手で務めす。 務める ああうるさい務めす人は苦手で。
turns into:
edited james' code bit testing purposes (i'm using edited version check kind of strings throw off code. far i've discovered spaces in vocabulary cause trouble.
this code prints original line below parsed line.
change line:
fout.write(output)
this
fout.write(output+str(line)+'\n')
this regex should deal cases looking (including matching longest possible pattern in first column):
^(\s+)(\s*?)\s+?(\s*?(\1)\s*?)$
you can go on use match groups make specific replacement looking for. here example solution in python:
import re regex = re.compile(r'^(\s+)(\s*?)\s+?(\s*?(\1)\s*?)$') open('output.txt', 'w', encoding='utf-8') fout: open('file.txt', 'r', encoding='utf-8') fin: line in fin: match = regex.match(line) if match: hint = match.group(3).replace(match.group(1), '{...}') output = '{0}\t{1}\n'.format(match.group(1) + match.group(2), hint) fout.write(output)
Comments
Post a Comment