Reader tool parsing issue

aendur · February 20, 2021, 10:08pm

When pasting a text into the reader, it sometimes does not parse expressions correctly when they are composed of smaller sub expressions. This seems to be specially the case in the following situations:

When the expression has okurigana followed by more kanji, such as

肌で感じる gets parsed as 肌, particle で and 感じる, where it should be parsed as one single expression

When the expression is fully written in hiragana

風鈴 is parsed correctly, but ふうりん is parsed into ふう (seal) and りん (one-hundredth)
はだでかんじる gets parsed as the particle は, だで, かんじる

Also, some expressions with okurigana are parsed correctly when written with kanji, but still fail when written with hiragana, such as:

食べ物 is parsed correctly, but たべもの is not parsed at all
入り口 is parsed correctly, but いりぐち is parsed as いり and ぐち

Overall, it looks like the parser is trying to prioritize particles when it deals with hiragana. It also appears the parser always prioritizes short words/expressions over long ones. Since it may not always be possible to determine whether the desired expression is the short or the long one without context, I believe it might be useful to have an option to either suggest corrections for the parser or to display other possible results.

Neicudi · February 21, 2021, 12:26pm

Hey!

Thank you for the feedback!

Parsing a language such as Japanese is very difficult to do well due to all the different ways of writing a single word, let alone that it has to break up the sentence into words to begin with. Keeping that in mind, it’s not surprising that it has some issues.

I’m not sure how I could make it recognise 肌で感じる as an expression as it’s parsed as “noun” “particle” “verb” (which usually makes sense) and putting those together as one “word” might work in this case, but probably not in other cases… I’ll have to do some tests with it to see if anything can be done.

This is another very difficult one for the parser. ふう can mean a lot of different things in Japanese. The same goes for りん. In the Kanji version it’s easy to put them together as it narrows down the meaning, whereas with hiragana, it could basically fit for 20 different variants.

Same reason as above

As you mentioned, sometimes it’s extremely difficult for the parser to figure out what the correct tokenization of the sentence is. Using Kanji forms usually works better, but in real Japanese texts you’ll often find hiragana versions of words regardless of whether a Kanji form exists.

One of the initial plans for the reading tool was to be able to send in suggestions or mark miss-parses as wrong, and I’d still like to implement that feature sometime. At the same time, I do wonder how to accurately deal with these kind of situations though. Would you happen to have any ideas for that? Or if anyone else has some ideas I’d also love to hear it of course!

Thanks again for the detailed feedback! I hope we can figure out a solution to this problem soon

aendur · February 26, 2021, 2:29am

Yes, to be honest, I believe the ability for the user to manually select and tokenize some segments of the text might be a good compromise, since there can be a lot of ambiguity involved even if we take context into consideration. It would be a very useful feature if the parser could then present us with alternatives for the tokens, or when dealing with kana-only text.

Aside from that, perhaps this is already being done, but maybe including or increasing the length of lookaheads might yield better matches when creating the tokens?

Neicudi · February 26, 2021, 12:45pm

The thing is that the text gets parsed again every time you open the page, so any segmentation the user makes, would have to be reflected in the code of the parser. The tokens themselves are not being saved to Kitsun (too much data). So it would work for making one time adjustments in order to get the vocabulary/word that you want and generate a card from it, but it wouldn’t be lasting until we solve the issue in the actual parsing code.

That’s kinda already being done in the sense that it always looks forward until it doesnt match anything that could create a combination. In this case it might be able to adjust it to look even further ahead though, so it’s not a bad idea! Just have to be very wary of unintentionally tagging things together that aren’t supposed to be together

Thanks for the suggestions by the way, appreciate it!