Special characters in reader texts get parsed as 々in known words

quexten · October 17, 2022, 11:49am

Hi, I noticed the text “known words” percentage seemed a bit of and did a little digging. When looking at the “wordlist” returned by the internal API, there seem to be parsing issues related to special characters. For example, for the text "「 !? 大学…… 名前！」 " (Doesn’t have any meaning, it is just crafted as a demonstration), the word list returned by the API is:

"々",
"!?",
"大学々々",
"名前々々"

even though “々” never appears in the text. The text is marked as 25% known (even though the text contains no unknown words, especially 大学, 名前 are known). In my actual texts, there are a lot of words in the wordlist that have multiple “々” characters appended for no apparent reason (other than there being special characters), thus making the words count as unknown. Could these special characters be excluded from the known words parsing? Otherwise I could of course pre-process the texts before uploading them to Kitsun, to include no special characters but that would also hurt readability.

Neicudi · October 18, 2022, 2:52pm

Heya!

I just checked and can also see the wrong behavior. Something weird is definitely going wrong with the text parser at the moment. It does not seem to split symbols from actual words anymore.

Going to take a look!

Neicudi · October 18, 2022, 3:47pm

Found the issue, just deployed a fix!

You might need to resave the text for it to work properly again after the fix is live (make sure to refresh first in case you still had a kitsun tab open).

quexten · October 19, 2022, 12:03am

Thanks for the quick fix! After clicking edit and save, it seems to be correct now