Kanji Data

Posted in

#26 by tester
2021-04-10 at 19:40
< report >Okay, here you are: JPScriptScanner.

It's on Python, but all installation steps are on README.MD.
I have also written a simple CLI for it.
#27 by tomtheerogeman
2021-04-11 at 00:11
< report >I followed all your instructions, tested the program and it appears to work perfectly for displaying the kanji list. But the words list is missing. This was the command line right before it closed, in case you need it:

Введите папку скриптов (по умолчанию in_files)/
Enter the script folder (default: in_files):
Нет такой папки. Используем папку по-умолчанию.../
There is no such folder. Using default folder...
Введите название кодировки (shift-jis, cp932, utf-8, utf-16 и так далее.../
Enter the encoding name (shift-jis, cp932, utf-8, utf-16 etc...): utf-8
Введите основу выходного названия статистики/
Enter the base of output data name: itsusorawords
Введите название файла (UTF-8) списка слов для исключения/
Enter the bad words list file (UTF-8) name: non-spoiler-list.txt
Введите название файла (UTF-8) списка кандзи для исключения/
Enter the bad kanji (UTF-8) list file name:
Ошибка загрузки файла кандзи для исключения/
Error of loading bad kanji list.
Текущий прогресс по кандзи/Current kanji progress: 50.0%.

Is it related to MeCab or something? I installed both MeCab and the recommended unidic-lite using pip like in link , but it gave me version 0.996. For the %PATH% (environment variables list I assume) I have C:\Program Files (x86)\MeCab\bin set as user variable and system variable (not sure which one so I did both).Last modified on 2021-04-11 at 00:14
#28 by tomtheerogeman
2021-04-11 at 04:46
< report >Sorry I was confused, I installed 0.996 of MeCab itself from link but MeCab-Python3 is like a separate thing in link , I have 1.0.3 of that. Are the versions supposed to match? If so I would need installation instructions for MeCab-Python3 version 0.996.x because they updated the pip repository to 1.0.3.
#29 by tester
2021-04-11 at 06:58
< report >Well, I did install MeCab 0.996 and it works correctly.
Hm... To figure out the problem I'd need to know the error message.

Try to launch main.py with Python 3 IDLE (included in standart Python package), then post here an error message.Last modified on 2021-04-11 at 09:14
#30 by tomtheerogeman
2021-04-11 at 07:24
< report >Using IDLE I got this message after kanji progress reached 100%:

Failed initializing MeCab. Please see the README for possible solutions:

https://github.com/SamuraiT/mecab-python3#common-issues

If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:

https://github.com/SamuraiT/mecab-python3/issues

issueを英語で書く必要はありません。

------------------- ERROR DETAILS ------------------------
arguments: -Ochasen
error message: [!tmp.empty()] unknown format type [chasen]
----------------------------------------------------------
Traceback (most recent call last):
File "C:\エロゲ\jpscriptscanner\main.py", line 80, in <module>
scannow.word_data_to_file('{}_WORDS.txt'.format(name))
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 53, in word_data_to_file
self._add_word_data_from_files(toer)
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 284, in _add_word_data_from_files
count = self._add_word_data_from_file(i, symbols, freq, count)
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 321, in _add_word_data_from_file
tagger = MeCab.Tagger("-Ochasen")
File "C:\Users\Richard\AppData\Local\Programs\Python\Python39\lib\site-packages\MeCab\__init__.py", line 124, in __init__
super(Tagger, self).__init__(args)
RuntimeError
>>>

Edit: I should note that I already installed the Windows Redistributable and ran pip install unidic-lite, which the readme suggested.Last modified on 2021-04-11 at 07:31
#31 by tester
2021-04-11 at 07:47
< report >Okay, I understand the problem (it's quite strange I have not this problem as well). There are some number of ways to resolve it, but I need to test it first.Last modified on 2021-04-11 at 08:00
#32 by tester
2021-04-11 at 09:04
< report >I did rewrite the program to use not Chasen mode, but dump, embedded mode, which should also work with up-to-date unidic-lite.
New version is also here.

Try this one.

===
It may work a bit slower now, but accuracy (of words tokenizing) should increase.Last modified on 2021-04-11 at 09:06
#33 by tomtheerogeman
2021-04-11 at 15:40
< report >Sadly it still didn't work, but I got a different error this time:

Traceback (most recent call last):
File "C:\エロゲ\jpscriptscanner\main.py", line 96, in <module>
scannow.word_data_to_file('{}_WORDS.txt'.format(name))
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 53, in word_data_to_file
self._add_word_data_from_files(toer)
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 284, in _add_word_data_from_files
count = self._add_word_data_from_file(i, symbols, freq, count)
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 326, in _add_word_data_from_file
if (not (self._is_japanese(new_line))):
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 363, in _is_japanese
if (len(liner) == len(liner.encode('cp932'))):
UnicodeEncodeError: 'cp932' codec can't encode character '\xab' in position 5: illegal multibyte sequence

I notice cp932 is an option when selecting an encoding, but notepad shows UTF-8 for that file. This is one of the VN scripts found on wareya's old website, I didn't extract it myself. Probably the same place where you got the script for muramasa.

I tried entering cp932 instead of UTF-8 but that didn't work either.

Now trying a different game, error is slightly different but same thing where it can't encode something: (fortissimo EX instead of itsusora)

Traceback (most recent call last):
File "C:\エロゲ\jpscriptscanner\main.py", line 96, in <module>
scannow.word_data_to_file('{}_WORDS.txt'.format(name))
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 53, in word_data_to_file
self._add_word_data_from_files(toer)
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 284, in _add_word_data_from_files
count = self._add_word_data_from_file(i, symbols, freq, count)
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 326, in _add_word_data_from_file
if (not (self._is_japanese(new_line))):
File "C:\エロゲ\jpscriptscanner\JP_script_scanner.py", line 363, in _is_japanese
if (len(liner) == len(liner.encode('cp932'))):
UnicodeEncodeError: 'cp932' codec can't encode character '\ufffd' in position 82: illegal multibyte sequence
#34 by tester
2021-04-11 at 16:09
< report >
Sadly it still didn't work, but I got a different error this time:
Well... These kind of errors shows when you try to give program not the plain text scripts, but compiled scripts. Or just choose a wrong encoding.
> No, that's not it.
It could still be a bug. I'll need to test this, but for that I'll need the scripts.
- No, I don't need the scripts anymore. And it's definetely a bug.

You may try to give me that scripts you tried to give to the program. I could see the exact problem and fix that.
> No, I have already figured it out. I'll fix it soon.

Probably the same place where you got the script for muramasa.
No, I did extract Muramasa's script myself. Likewise Boku to Iu Mono's script (well, I was the one who forced original NScripter to work with Russian translation in that game).Last modified on 2021-04-11 at 16:28
#35 by tester
2021-04-11 at 16:26
< report >Here you are: yet another new version.
I did modify Japanese symbols checking. Program work time shouldn't been affected.Last modified on 2021-04-11 at 16:26
#36 by tomtheerogeman
2021-04-11 at 17:03
< report >IDLE showed no error messages, but this was the resulting WORDS.txt :(

=== Уникальные слова/Unique words (22):

動詞%F4@2; 形容詞%F1; 名詞%F2@1; 形容詞%F6@0; "形容詞%F2@-1; 動詞%F1; "名詞%F1; 動詞%F2@0; 形容詞%F2@0; 形容詞%F4@-2; "形容詞%F4@-1; 形容詞%F2@1; 動詞%F4@1; "名詞%F2@1; "動詞%F1; "動詞%F2@1; 動詞%F2@1; 名詞%F1; 一般; 形容詞%F2@3; 形容詞%F2@-1; "動詞%F2@0.

=== Слова по частоте/Words frequency:

1. "動詞%F2@0 — 27264.
2. 名詞%F1 — 26048.
3. "動詞%F2@1 — 14664.
4. "動詞%F1 — 13344.
5. 動詞%F1 — 7808.
6. 形容詞%F2@-1 — 7444.
7. 形容詞%F4@-2 — 5489.
8. 動詞%F2@0 — 2683.
9. 形容詞%F1 — 1533.
10. 形容詞%F2@1 — 976.
11. 名詞%F2@1 — 883.
12. "形容詞%F2@-1 — 842.
13. 動詞%F2@1 — 402.
14. 動詞%F4@1 — 336.
15. 形容詞%F6@0 — 203.
16. "名詞%F2@1 — 193.
17. 動詞%F4@2 — 51.
18. "形容詞%F4@-1 — 18.
19. 形容詞%F2@3 — 10.
20. 形容詞%F2@0 — 7.
21. "名詞%F1 — 4.
22. 一般 — 1.
#37 by tester
2021-04-11 at 17:11
< report >
IDLE showed no error messages, but this was the resulting WORDS.txt :(

It's definitely not okay. I think it's problem with the program handling Unicode... Because in my tests (with shift-jis/cp932) there was no such problem. Maybe I'd need to rewrite some functions.
Give me that scripts to test.Last modified on 2021-04-11 at 17:14
#38 by tomtheerogeman
2021-04-11 at 17:31
< report >link

It's itsusora.txt, fortissimoexs.txt and harumade kururu.txt that I tried.
#39 by tester
2021-04-11 at 19:02
< report >
It's itsusora.txt, fortissimoexs.txt and harumade kururu.txt that I tried.

I have no such problem. Neither with Itsusora, nor with FortissimoEXS, nor even with Harumade.
And dump mode should be an embedded function... It just ought to work correctly, if I'm not mistaken...

I have 12406 unique words in Itsusora (with non-spoiler-list, if I'm not mistaken), 14385 unique words in FortissimoEXS (without that list) and finally 8828 unique words in Harumade (also without list).

Write here exact method you got only 22 unique words from some script (all your input) and exact script from which you did get it (post here contents of input folder in moment you did try to use the program and got that result).Last modified on 2021-04-11 at 19:16
#40 by tomtheerogeman
2021-04-11 at 19:25
< report >The 22 words problem was from itsusora.txt. The word count does not change regardless of whether or not I use the non-spoiler-list.

IDLE input is as follows:

Введите папку скриптов (по умолчанию in_files)/
Enter the script folder (default: in_files): in_files
Введите название кодировки (shift-jis, cp932, utf-8, utf-16 и так далее.../
Enter the encoding name (shift-jis, cp932, utf-8, utf-16 etc...): utf-8
Введите основу выходного названия статистики/
Enter the base of output data name: results
Введите название файла (UTF-8) списка слов для исключения/
Enter the bad words list file (UTF-8) name:
Введите название файла (UTF-8) списка кандзи для исключения/
Enter the bad kanji (UTF-8) list file name:
Ошибка загрузки файла слов для исключения/
Error of loading bad words list.
Ошибка загрузки файла кандзи для исключения/
Error of loading bad kanji list.
Текущий прогресс по кандзи/Current kanji progress: 50.0%.
Текущий прогресс по кандзи/Current kanji progress: 100.0%.
2465
Текущий прогресс по словам/Current words progress: 50.0%.
Текущий прогресс по словам/Current words progress: 100.0%.
22
>>>
#41 by tester
2021-04-11 at 19:34
< report >
The 22 words problem was from itsusora.txt
Hmmm... Current progress wouldn't be 50% and then 100%, if there'd be only one file in in_files folder. It would be 100% from the beginning. That function was created to show progress while working with multiple files. Are there some other files in in_files folder?

Well, I need to think why you have that problem...
Maybe at least with PyCharm you wouldn't have it...Last modified on 2021-04-11 at 19:42
#42 by tomtheerogeman
2021-04-11 at 19:44
< report >The only other file in that folder is .gitkeep, it came with the .zip after I downloaded from github. I tried deleting it and running the program again, but it's still 22 words:

Введите папку скриптов (по умолчанию in_files)/
Enter the script folder (default: in_files): in_files
Введите название кодировки (shift-jis, cp932, utf-8, utf-16 и так далее.../
Enter the encoding name (shift-jis, cp932, utf-8, utf-16 etc...): utf-8
Введите основу выходного названия статистики/
Enter the base of output data name: result
Введите название файла (UTF-8) списка слов для исключения/
Enter the bad words list file (UTF-8) name:
Введите название файла (UTF-8) списка кандзи для исключения/
Enter the bad kanji (UTF-8) list file name:
Ошибка загрузки файла слов для исключения/
Error of loading bad words list.
Ошибка загрузки файла кандзи для исключения/
Error of loading bad kanji list.
Текущий прогресс по кандзи/Current kanji progress: 100.0%.
2465
Текущий прогресс по словам/Current words progress: 100.0%.
22
>>>

Edit: installing PyCharm community edition now...Last modified on 2021-04-11 at 19:49
#43 by tomtheerogeman
2021-04-11 at 19:58
< report >Same problem after installing then opening PyCharm community edition, however I don't know what plugins you are using for it or anything.
#44 by tester
2021-04-11 at 19:59
< report >The same? The same?! How?!

I need to think and test more...
(I did use only mecab-python3, kanji-lists, unidic-lite, pip and setup-tools in my virtual environment).

=== Edit:
Oh, after testing on other computer I may have a slight idea...Last modified on 2021-04-11 at 20:15
#45 by tester
2021-04-11 at 20:33
< report >Okay, you can delete PyCharm now if you don't need it.

I finally did hunt that wild bug now. And I also did a few modifications (you'll see in result file) to improve output even more.
Now try this version.

I hope this one will finally resolve all problems.Last modified on 2021-04-11 at 20:35
#46 by tomtheerogeman
2021-04-11 at 23:41
< report >Finally it appears to be working! I giggled like a school girl feeling like I don't need to look up stuff anymore while I'm reading. Thank you so much for your hard work.

I don't know if you'll even want to look into this after all that trouble you went through, but I found something strange. The kanji count matches wareya's website, but the word count is off slightly compared to yours. You got 12406 unique words in itsusora without the non spoiler list, but I got only 11383. Was it the modifications you mentioned? I guess I'll just have to try it one day, after I finish the rest of my Anki cards, which is in maybe 2-3 months. Now I need to look into exporting all my vocabulary to plaintext for the blacklist...Last modified on 2021-04-11 at 23:42
#47 by tester
2021-04-12 at 09:08
< report >That number was before new modification. Now I have 11383 unique words in Itsusora (without that list).

Before the tool did not try to convert verbs (and adjectives, if I remember correctly) in kana to kanji+okurigana, so there was number in cases when both kana variant and kanji+okurigana variant were present. Now there should not be such problem: all verbs (and maybe adjectives) converts to kanji+okurigana form.

That's why words list contain "1. 為る — 6695.", while kanji list contain "60. 為 — 470. ДЗЁ:Ё:/JOUYOU. N1.".

That system has one shortcoming, then it's comes to verbs that may have two or more forms of kanji+okurigana, like aru, but I think it's not so significant.Last modified on 2021-04-12 at 09:18

Reply

You must be logged in to reply to this thread.