Kanji Data

Posted in

#1 by tester
2021-04-02 at 18:24
< report >Some kanji data, obtained via one of my test tools.
Why it's here? Dunno. Because it may be useful for someone?..
(For some of full editions).

=== Unique Kanji (3085):

畑條景雀醒傲迎蚩特議困預枇揉堪熾盟瓜醜均紹清零裟穣矢齬臍願途違臆詣法筋沿念整僅簾八免左眠借軸嘲防販川仙鮮詭襤賭平飾淆寥蛉緊参飯水燎鯉肖坩墓囃瓦約塗来褸伸科暫泳趾染頚痍躓漁悼茣鑿扮脳父埋良恩蜥英恃岩夕乞銘曰机符侠師手匂鋩去兇嗅菜洪腰慙螂勉窮池荷置撹空刃巨舫伊祝沢征滑状夷民遙累魚袋則瓶稿条暁頷貼解復票脹鮒斯端縷次滓鰆急侍誅張閊善怩浦某櫻宿除朽従燈糖尖目恨顛削使果熊希叱嵩務躾貝肛錬癖到南挟歪題躱育倖輩躇顕塩門誉姓消靭憾痕枯弁抱暦炎掻叙進板螺咆虚偲囁衾帯字被吾挑賃伺爵蛇掘鯵露磨郡鱗堝析秘懊遭炬浅訶哀啖掃吸醇藩旬徘屍摺苦幹慨昇尿叔記涙嚼印員撒刊哄噛由囲偽裂宥較喇時逝崎巣全捨物催憑任宗討惚蒙縋有覿蛟麦誤枢縫幟休球樹呟呑落盾迄垢昏呵耶楔磁懃俄覧媚見食種利神撓許鬩琲紋哂瘤恐檻基隠局綾陣像摩戯盗失背渦砲悧石衰聘誇谺乾莫烙枕赤友麗浪挨燕僚靴義過峰頸練熨歴右角椎殻鎮錚熄磯瀑棘昆荘医巧匠忍体海辿糾仇雁収班彩巷猛療内燻唸勤熟泡膳抹燵拐材陥碗嘘弧酷冠保闖作呼憎鵡藪恣竦馬予踊大動曇沌鞠鏝祈陀完麻底戈律尉箒濡御膜串近滞姦沈他毘盤喉麒赫篠遣加窒眸臥軒鵬仕溶眼僻幅磔嗚欄戻欲痙準吠騒堡潜撥凛挽柑静勘寂梯礼莞峡突擂副察晦飢気濘杖冶摂油魔凸供注街泰爺筑侵群攫乍泉吏攪藉含満丈糺婚躙修究危剽裡尼銃喝住鋼京数涸即猥冒咳嫌匙季建豆疚故琢銭寛拾項骨訓期滔哉虎藻痩装新遺築剌壇益甥朶難処賑錯承房久逆電呻瀬孟旨簀励縊携蜂唾洸稷警昼随刹実憬繊系葱畜審捻獄其斂奢札乙詩嫋轟非獨魄巻層劾括惰府半稚兆爬至理雅喊驍遠枝撫徴猟媛浄管旺十紺蛸以鼓剥箪鬨引燼豹娶並笛掣奇湾攻雷始舗社月琳鉾殲昨捩弥蹲何企耕肘働鍍颯屯避蝕柩庚切嗣片雪雑釜剣盥康竹埒琴妻唖酎虫緒檀綿駿娼配乱鋤悔彫薙敷補噴拓甦誌光売絵伯唐差蕭乏吝村蛙蠱須桃茨抗七晰懸超迸根漏沃模漸喜換艦薪母鷹纏鋳腐隷敢里卓闊虐就斜臓云屹或懣鴨穴謀擦然隆芽揆赴藁活釣衒帳投牡広裸毒商惨鄙煩冴羅藤縒指槽繋変必傑針黒沸閃冑塔償県膨強珠町凶騙翌嘯鉱担燭粥櫃呈啓極禿述塞耽孵倚鍵訝扈邪開機遡挿輪華寝績凍昧蝙戮競嵌蛾飲隣榴惹短餞仏愚敲米夜湊庁渇響選鳥袈鐚懦魂戟柔駆漿毫功床亡戒味閻捺已陰柱倦億磐跋課申稟貨脊二税揶惜嫁術棺健尹氏倣症疇偶自救包這長睾曲狭階慮誓瞳慌深博仮恵慰鐸譲操堵狼和伐諭秒桂略夏懲蝋揺佞彰伴綴株沮容鵜髭残尊当綜簡敬褪範葉映珍剤祟日棄悸猫濃茎汗塀蜜堤版霞頭徒山政罰姫奪代属禽庭羞殴鶴遁俊鹿瞼蕾五酬寸迭没聖世喰肴蜴捕捜旛灼優履東封怖猜謔轢毎昂序布各弘蓮柏薇得入暖齎候詈接懺遜嘱破湯儂颶創繁膏飛未度舟朋斉蔓丹黄舷翻洩堕狡造豪柳薩蜘弱昔犠皓丁謂焼晴勒択禅芸洲澄刻笠心集国爽俗鳴決際花翳策禍嘉闇評谷跨紡軌紐給瞞流歌鞄劒忙厚調葦謙噎腿頼高慎扶悦洛分勢辰絨談振茸鋒薬震意憐招扼団乗訪樽頽暇沁貢猪卒央検唇糎斧踏署双碍鉄脂賄窃別嬲瞭羊燐堯拡淡飴押瑕柿鎌六穏汁緻喫発屑雰薔崇助苛計往扱勇勃維結仰緯緘怨閥採努足樵汝好穀寒屓馴春為外焔犯今娠胞膝狗格狽郭蝿監恋妥潮霜叩宜鞭暴蠅設闘環躊運望轄枷享棋栄炮撰顔劇永喋譜虱井富多関淫沙膣齟煮脚慄怠存國箭膂鼾部生原況蟻癪炸元史焉地卍嫡番慕墨雨襖些粋璧稽炭竣肪試情滲周堂佐蟲菌淵貫呂喪駒挙爆澹刷穂衛饗夫錐寇説判壺祖践毟陳駕疲園託凌勝純観将締吟督寓節諫諒鎧胆索蔵貯学瑾尤赦碧眉糸弦播脛首罠尻顎梟檄隈楽暑櫛境蓋回効拠榊凝屠捌釘答豊賜諧恬第疎驚湛子冷宮侈倉密坊凪囚龍掬拵穿烈降話済芻漫搾詫湖曹異暮玉麿俺麩舵悪礫吶篭正濫馳力盛弭蚓輿坪庶巫刀肩爛弛椅個艶武麟蝗規幼婆洋芯此凱坤伝胡尋烏職竺溢乃洒頁愛廃斗鍋鼬礁腎衷下俎涯僕宰己楕輝勧浜延褐名出枚構籠禄般臭猾糧碇色路恰執慧藍要浮濛沫霊幣侘肌截雖冬払私紙値傾釈勿続舐増斐宝恚貌相疼賀傷陽四裁躁刺贈棚傭廊程照褄杓稀孺嶽絡笑幽慢蒸辞坑屁測遊息砦鞋嗄納裕髪疑砕幻吼襟肺籤逼菓虜粧雲謬件両演譫慣可杯濁器曖遂汎聞営輸廓添燥撼常揚詮都殖奉拗求普齢諮渡岸錫矜猴老面松朱桁革愉席槃倒枠淑咄賞幾酒離車火儲戚行杵涛鷲妊疾役廉域宅浸諺放憩攣及所終剋涅喘彷星哲辻風惑織縮泊亭漠雛咥耳裾竜皆守謡逸肉貸病係州独蒼騎瞋載場鱇斬罅咀野嬌陵精紘影損旭図斃獰鉤西太欺鯨似兜夢録牒験末充鎚庫示通診熱佇看客又険牲瞥籍僥妄奮硝萌跳儘感紀仄命仔縄態唱蟇砂襲塁週闢鐘美碌告脱蓙矯抉腱前寧邸忠慇韻跡贋脆迅権欒訂粒募余紆憧滾挺専沼年匹帰壷留卜痢髄拍滸威旧統杭我獲講曽啼複午寡宣哭了鋸停嶺泥田湧憤囀鏃鱈圭毀鈍握折寵粟填巴茂狙福酔趣杏噤雇壱啜憚文贖舌現冤捏介諸槐溌速傘祢紛道偵潤紫勁秤襞髀淋農妙土旋妨蘭屋軍悶漂而晶鮨厨辜識興詳厳汲摯天涜嗜達挫送婉孝躍卵霧渋餐僑視酊佩姿隊鬼令排僧醐狩晩媒蕎声標拮梵的郎芳擲界做坦妬嘗蜃骸駁債沽潔悟迦蹴鬚艱楠滋蜈腋逗液積颪憶拳詞癒抽稲葛舞丸登侭眈丘最碑用梁叭事閂微粛邏届認詛焦港辟眺呷姉雫追尚簒万込獅托淀叫哨船銅姪悍者舶同恍臨白澱証侯凄涎鑠蟠覗克打抜嚢筒重猶寺古帷会金鴉黙珈醸鶏初薄丼掠殉芝素在也頑渾稜刑剔奥諜牧朴緩千眦穽函掟戦寿先券妹畏菊粍制雄画按搦擱飽嘴与兼邂族頬瞑焙財隕洗誑飼儚号亦坐館粗酩備披典陸蹄冊疫鰭鸚趨怜宇嗤敵蠣誘槌迫閣王窪厩混後帽捗娘資霰婿負距性腔授蛍鍬鉈巾炙囮彙貪窓陪殺論帆蔽覆身替絞揖案賓酸若梨駄賛坂弐氷小嬉械間児嬢隅忽爾胴粘鍔叡篤哮取刈段応湘党溺謳沖源委快憂皿杉漕鑑治適墜横捧祀股剰列額致橙衝化黎孫歯拭絶尾教怪越女合喚貞干擡駐鞘涼献受妖蹕覇軋臣岡鳶才岳拒拱蜻箸瀉恥暈拷暢圏采謗嵐巡畸算少窶掴氾昭惧硬因敏尺徳綺固繰董陛宛奸吻杜聴痴匿穢買上喩様慈点狐弊愧窄燃肚婦蛄貰睡喧謐姻唆批朧蝦凹衆捉形三訊棟閑順木漢梅橋俯擁辛酵痺惣牛口閉簿掛君勾幸林忖植溝鞍鏑反繭吐乳早蛤靡肯冥痒罵臼悠翁麓拌魁辣径円漲鎬圧省連煕章厭隙北悉縛細親障套労知玩彼式契肢杳禊嘩苧側互刎恕奈鼠滝裲脅狸櫓波鵠偏智鬱慟価畳悩支信訴絆温翼逮兵却聳咲朦懇是併遍否奴弄祥塵簪探粉費捲殊透燦箍蚯頂蝶葬禁儡擽如嘆誰濯経眩家頤噌煉痛拘睨黴佃既射繚苞便腕擬軟語死埃工秋殆湿伽劫仲箇樋継措量稼壁台砥炊向撮把錦協姐貴朕報麾概聊愴弾摘謝位膠崖瞠蕩甘窟転据矍安鎖仁怒咤菫九曳騰瑣庇詠弑峻撲限怯唯靂着踪争煌蛛居苔濤耐音貧亀汐官戸区獣朝畿河援立箱諍逢嘔甲栖占獪徹呉祭吊搬鈴杷攘旅型徊忌依頻玄遥緑襠嬪崩桜犬賽侮侶駅吹缶恭錆激草等諛室蹂起例遽更欧瞰青恒翔汽劔柴慶乖抑卸考秩蟹阿偉錠編走付宵蠕且吉倅孔漬該曾箴無思産店抓訳蛮貶肥嗟轡誕幡瑞服堰待垂宙惟噂施胃蟷患一隘駑彗殿灯狂猿護竟輌徨皺壮邁帥触謎成椿謹盲蚣扉矛牙鑰撻嚇歓培還伍梃士祓派睦淹鏡唄頓定洞揄拉靄辺句城誠人遅萎肝瞬儒絢賊壊公矮倍楼睥逡率忘拙曝市檎弔渚壌笹読肋薫聡宴推扇煤奔鉛煙誼瀕災驕筈掌罹撃真帝勅需領桶蛆孕狄遷製煽島憫業愁芋能蘇菩擢栞移低羽袖后餉炉裏言不棒泣凡刳弩懼汚艇贔刮舎座灰閧梢敗縁軽購儀咽懐兄架茶査孤類栓級窘漆忸只銀恢窩百窺潰蔑洟冗研廷韜褻院桑逃噺塊焚鍛愕賤衡搭笥鋭想線澳森揮痣癇倫胸泄傅酌総貿確苗減拶中俟江斎罪返旦滅毛劣爪浴鏖裔舳楯朗亜綱遵郷兎隔暗諾止螻易阪晒直呆志垣齧饌髷羆胎雉迂尽隻釉轍褒牝蚊共腑寞称持溜弟躯饅蓄皮糊品傀胤網顧之郊主耗姑薨滴遮餓叶組血遇但衣毯迷蠢蒐誂脈跪促責荒俳揃茫嬰描比叛再敦眇虔束撤阻峠煎弓粂膚筍渉斥交寄癌辱牢魅逞徐醍甚抵糠対罷控蠍糞歩写筆憮那厄方質升屈軛秀童悲耄割羨展著欠単俵鳩饐睛諚桧餌脇技郵堀夥詐改廻退傍愾芥栽僭頃津膾諦膿詰佳霹站杞餅香踵書腫歳櫂牽皇融明娑贅害陶掲召鉢養紅咎拝戴瘴男鼻穫豚司校鋏狷航縦茜恫嶋呪鮟礎贄畢習賢提幕蒲腥繕核汰料剛具覚散逐腸問逅綻表旗椋奏柄翅伏岐導断澤象卿浚堅卑紳讐峙邦腹槍本雌請曜訣庸蝠

=== 100 Most Frequent Kanji:

1. 景 — 49736.
2. 明 — 38469.
3. 絵 — 33367.
4. 正 — 22351.
5. 村 — 19586.
6. 常 — 16988.
7. 通 — 16564.
8. 一 — 14559.
9. 背 — 12651.
10. 枝 — 9714.
11. 茶 — 9548.
12. 丸 — 9469.
13. 奈 — 9325.
14. 香 — 9299.
15. 演 — 9257.
16. 条 — 9046.
17. 上 — 8471.
18. 名 — 8270.
19. 光 — 7633.
20. 戦 — 7404.
21. 人 — 5429.
22. 長 — 5418.
23. 間 — 5227.
24. 闘 — 5186.
25. 動 — 5154.
26. 心 — 4928.
27. 子 — 4926.
28. 色 — 4618.
29. 黒 — 4464.
30. 大 — 4459.
31. 何 — 4377.
32. 前 — 4126.
33. 定 — 4076.
34. 時 — 4024.
35. 立 — 3734.
36. 飛 — 3670.
37. 童 — 3657.
38. 撃 — 3647.
39. 空 — 3596.
40. 声 — 3580.
41. 事 — 3545.
42. 俺 — 3491.
43. 他 — 3453.
44. 体 — 3389.
45. 小 — 3273.
46. 方 — 3268.
47. 騎 — 3125.
48. 男 — 3083.
49. 見 — 3039.
50. 幕 — 3033.
51. 構 — 2997.
52. 車 — 2981.
53. 町 — 2978.
54. 下 — 2977.
55. 者 — 2968.
56. 出 — 2931.
57. 無 — 2921.
58. 度 — 2912.
59. 雄 — 2898.
60. 手 — 2837.
61. 自 — 2823.
62. 思 — 2796.
63. 選 — 2775.
64. 笑 — 2769.
65. 機 — 2766.
66. 回 — 2724.
67. 用 — 2721.
68. 合 — 2709.
69. 作 — 2700.
70. 雪 — 2699.
71. 言 — 2678.
72. 吼 — 2632.
73. 二 — 2631.
74. 獅 — 2626.
75. 雷 — 2599.
76. 署 — 2578.
77. 択 — 2571.
78. 白 — 2548.
79. 敵 — 2529.
80. 画 — 2505.
81. 太 — 2497.
82. 分 — 2478.
83. 指 — 2477.
84. 転 — 2446.
85. 蝶 — 2399.
86. 気 — 2396.
87. 兵 — 2379.
88. 行 — 2356.
89. 突 — 2336.
90. 中 — 2334.
91. 世 — 2332.
92. 力 — 2300.
93. 首 — 2297.
94. 窓 — 2257.
95. 義 — 2256.
96. 保 — 2191.
97. 肢 — 2178.
98. 銀 — 2176.
99. 殿 — 2168.
100. 領 — 2168.Last modified on 2021-04-02 at 18:25
#2 by tomtheerogeman
2021-04-02 at 18:37
< report >Have you thought about releasing your tool on github or something? That would make my hobby so much easier. Thanks for the list!Last modified on 2021-04-02 at 18:37
#3 by tester
2021-04-02 at 18:44
< report >
Have you thought about releasing your tool on github or something? That would make my hobby so much easier
Well, currently I'm testing it, but at least my data matches (well, at least it's close enough to) the TLWiki data.
And currently some number of my tools are awaiting for github release (I don't have time now to actually release it), so I'll release it after some time (may be month or so?).

And it not that you can just "run a VN and after a second get all kanji data". No, you actually need either to extract scrips (and if they are compiled, you'd need to decompile them or at least extract strings from them), or just to hook the strings.

In very long games like Muramasa you'd need to wait some time too, but it's the least of the problems.Last modified on 2021-04-03 at 09:11
#4 by janpri
2021-04-02 at 19:03
< report >My eyes are tired just from looking at all of that... It's incredible someone can actually learn to read this and remember every single character.Last modified on 2021-04-02 at 19:03
#5 by draconyan
2021-04-02 at 19:26
< report >Reading this was definitely a challenge. I'm not sure anything can top that.

I guess it's expected the top kanji are mostly the ones found in the names of the main characters.
#6 by tester
2021-04-02 at 20:09
< report >
I'm not sure anything can top that.
You may want to try this or that. I did not read these titles myself, but, as I heard, difficulty of these ones may top even Muramasa's (especially the letter's).Last modified on 2021-04-02 at 20:12
#7 by draconyan
2021-04-02 at 20:43
< report >
You may want to try this or that. I did not read these titles myself, but, as I heard, difficulty of these ones may top even Muramasa's (especially the letter's).

Thanks, I'll definitely keep them in mind. I did already have the second one in my wishlist for some reason.
#8 by shinytentacool
2021-04-02 at 20:56
< report >Cool wall of squigglies, dude
#9 by mrkew
2021-04-02 at 21:13
< report >The existing tool is link and the difficulty list says 3070 instead of 3085. Now who's wrong.
#10 by tester
2021-04-03 at 09:11
< report >As for TLWiki, it says 3086 instead of 3070 or 3085 (see here).
#11 by kiru
2021-04-03 at 10:34
< report >jwpce has a function to count. If you open the script with it, you could double check with it.
It also lists most common and such.Last modified on 2021-04-03 at 10:35
#12 by vninfohata
2021-04-03 at 11:55
< report >Math bad. I trust the jpdb.io more.
#13 by tomtheerogeman
2021-04-06 at 05:20
< report >Thanks kiru, I didn't know about that program beforehand. Although it appears to be abandonware, it still works on Windows 10 and can be found on tanos for the time being.

I'm not sure if tester is still working on it as a result, but I have a request. Is it possible to make the same tool, but instead of scanning for kanjis, scan for all words instead, like Rikaichamp or any other mouseover dictionary? I thought the kanji thing was useful because it could help me find and study vocabulary I might encounter, but words are even better because not everything I don't understand involves obscure kanji. Like I see 虱 for example but it's luck that that character's even there in the first place, very often that's written in katakana just because it's hyougai kanji. Thus you wouldn't be able to study it unless the script just happens to use the kanji for it, which is kind of rare.

If you're wondering what my goal is, about 3 years ago a few people had all their text hookers stop working after the Windows 10 creator's update. I don't recall whether or not we had any devs working on text hookers at all at the time, or even how that issue was resolved in the first place. That's partly why I study so much vocabulary. But if that happens again, a workaround for the (possibly) growing numbers of JP readers I see here is to extract/decompile the script to .txt format, use said program to scan for all unique words, put them into a txt file, transfer them into Anki or something, and that way you can study all vocabulary in the novel before reading it.

There'll be many false ones due to the lack of spaces in Japanese, but for people like myself who probably know 80-90% of words in a game instead of 100% that would be awesome. Tester I want to donate to you if you make this a reality.Last modified on 2021-04-06 at 05:21
#14 by tester
2021-04-06 at 09:57
< report >Hmmm... I'll think about this, but I don't have time to write a full-fledged JP parser from the scratch. It'll be quite complicated and mostly very time-consuming task (I'm talking about a good parser).

But it'll be completely other way if there would be some parser API (as dll, exe or some Python class) or library to implement parsing with. I know some JP parsers, but I have yet to know about their API. So I'll need to research them a bit.

But in exchange it'll need much more time to work (as I am expecting). And according to parser there may be some English words from scripts in output.Last modified on 2021-04-06 at 10:15
#15 by shiny
2021-04-06 at 16:03
< report >If I may make a suggestion, I think sorting whether a kanji is included in the list of joyo kanji would be useful, since that way you could judge how many characters the game uses that are outside everyday usage.
#16 by tester
2021-04-06 at 21:00
< report >Okay, I did some magic with MeCab library on Python to make the same tool for words and... the result of it work is there. Do you see some problems or something with output (not related with MeCab tokenizing, of course)? What grammar is pointless there and need to be filtered out?Last modified on 2021-04-06 at 21:44
#17 by tomtheerogeman
2021-04-07 at 06:07
< report >Tokenizing, so that's what this is called...

Alright, I read all 835 words and the majority of the output is okay, and it would still be pretty useful as is to intermediate-advanced Japanese learners like myself, but here are the issues that you might want to address:

I don't know how much Japanese you know, but assuming you know little about the language, first see link , then click "show inflections" to be able to understand #1 below.

1. When an u-verb is conjugated in either the past form, te form, passive form, or causative form, various mouseover dictionary programs may not show you the verb because the inflection in the end is missing from your output. It's not just the ta and te endings you mentioned, in other words. Because of this, those verbs would need to be added manually into Anki rather than automatically. The majority of verbs in Japanese are u verbs and not ru verbs, so it would take more time for the user to create study material for themselves the more u verbs are in the script that the user isn't familiar with.

For further context, the browser add-ons nazeka and yomichan support automatic flashcard creation, which would allow the user to add the word, its pronunciation and definition into an Anki flashcard by just a click of a button. Of course, after converting words list.txt to words list.html and opening the file with the web browser. This is not possible for many verbs when the inflection is no longer there. It's Mecab's fault as you said.

2. Several repeats were detected. These are 動か, 動く, 動こ, and 引き摺る, 引き摺ら. As you can see, they were repeated because the program removes verb inflections in the cases I pointed out above.

3. When 2 or more words are together and written in katakana, the output detects them as all one word. These are ココニハイナイ, オマエハイナイ, ノモノニナッタラ, and ソンザイシナイモノ . It would normally be written as ここには居ない、お前は居ない、の物になったら、and 存在しない物. This is normally rare as no other writer I've seen would write words in katakana in a row like this.

4. ボリュ, ゃっ, っと, and モノダ are nonsense. No idea how the first 3 got in the output when they aren't words, and モノダ is likely 物だ or 者だ, a similar problem to number 3, but this time it's one word in katakana with a grammatical particle in katakana right next to it.

5. I don't think there's any pointless grammar that needs to be filtered out. There are super common words like から (because), という (called), しか (only), etc. I guess you could say it's pointless because you should be familiar with them before you consider using this program, but at the same time they are real words, just like everything else in the output.

Also, I thought about this at work, and there should be a blacklist that filters out unwanted words, the reason being that the program will unintentionally spoil the endings to some novels. For example if I wanted to read E School Life and the output for that novel contained words like 自殺, 縊る, 病院, レイプ, 癌, etc. that wouldn't be good. It would be fine if it was a character playing a violent video game, a person killing a spider in their room or whatever, but the closer you get to the ending the more likely you are to suspect that something bad will happen in the end. I want to come up with a list of "bad words" to blacklist to significantly reduce the odds of spoiling the game, but it'll have to wait until this weekend because I already spent most of my evening writing this whole reply =)Last modified on 2021-04-07 at 06:10
#18 by tester
2021-04-07 at 06:55
< report >Blacklist is quite easy to create.
As for verb inflections and repeats... I did want to fix it from the start, but it would be problematic with model I use. I'll think about how I could fix it.Last modified on 2021-04-07 at 06:56
#19 by tester
2021-04-07 at 08:03
< report >Okay, partially fixed. New results are on the edited first post.
Long katakana rows are rare, and if I'd try to fix them, it may cause some problems to other words on katakana, so I left them be.
All verbs verbs now changes to "dictionary form" (thank's -Ochasen mode) (well, at least ones that works correctly on MeCab).
Katakana nonsense words partially fixed (MeCab sometimes has some problems with ー).Last modified on 2021-04-07 at 09:12
#20 by tomtheerogeman
2021-04-08 at 05:21
< report >Thank you for your work so far.

You don't have the original list anymore, do you? There used to be 835 words but now there's 732. I think your fix removed parts of verb/adjective inflections and grammar constructions falsely detected as words, as well as the repeats/nonsense I mentioned before, but I didn't expect over 100 false results in the original. If I can see the previous list I might be able to tell you what happened. At least the new output looks a lot neater.

Also, can you make sure that the blacklist can be edited by the user and contain thousands of words without overloading the program or something? I Googled and figured out that I might be able to export all the old vocabulary I studied from Anki, then add it to the blacklist so that if a game has thousands of words I can remove all the ones I already know and get a much shorter list to save time.

I'll still make a default list of "bad words" to avoid spoilers shortly.
#21 by tester
2021-04-08 at 07:27
< report >
Also, can you make sure that the blacklist can be edited by the user and contain thousands of words without overloading the program or something
I did this already. Well, it could increase the time of program work, but not critically.

As for old list... Well, I have only the oldest variant of it. Here is it:

心 感 青い くれ 知り 返す 落ちる 会っ 青ざめ 捕まえ 相反 いっか 求め 底 週間 何となく みる 見つかっ 答え なにか 微動 同じ 寂しい 無音 さっき 近づい 相変わらず よぎる こんなにも 覗き 望み まず ここ 付着 違っ 映っ あおく 笑顔 上手く 夕方 引き摺る 陥る 俯い 話し 思い切っ 仲 じゃあ 反芻 あっ 履 行け 座る 冷たい それ 入れ 苦しく 無理 想い 絶対 獰猛 変わっ それら どう 長い 足元 観察 ドロ なぜ 逸らさ けど だからこそ 悲しみ 明るい しよう これから 丈 ながら 視線 いう 痛 だって 確固たる イメ 度 いっぱい 見知っ てる 後 程なく 入り口 家族 より 泣き声 理解 やっ きり ぐるぐる 髪 興味 気に入ら 次々 中 走り出し ツギニ 不思議 何だか 話せ 賑やか オン っと なのに 以上 苦笑い 上手い 出来 居 悲しい 縛っ たいした ごめん 見つける 簡単 いったい 部活 下 人間 手 耳 黒く 裏切る 気 違和感 言わ モノダ 見回し 隣 顔つき 高い そう 逃げ 通っ 忙しく 木々 いい こんにちは だっ ぐらい 表情 なし 上 言っ 思わ 期間 電柱 靴音 傾き 逸らす 背 エン テキスト 寂し 来れ 人 それでも ううん きっと クリック 箇所 頬 なんか いか 楽し 目 受け入れ 問い 探し 家路 頭上 気がつい 初めて いれ 夜 っていう よろしく 身 今回 よぎら 当然 隠れる 付い 見送る 再び 感じ 出 効果 最近 待た 場所 あの 暗い 既 モノ 濡れ 以外 閉じ込め あいつ 嫉妬 考える 殺し 赤み 瞳 ゃっ 容赦 下手 見上げる 子 起き上がろ 落ち着き くん たん 灰色 呼ぶ 僕 的 接触 かも 疑問 いっ 枝 知れ 毎日 跳ね 隠し 散乱 わけ 仲良し なけれ 急 せい 入り 繰り返さ 終了 おかしい 早く 決まり 怖い 横 なかっ しよ こと 昨日 普通 ねぇ なら 始 いつも だけ 速度 赤く また 鼓動 意味 続け 叫び声 綺麗 じゃ まるで 笑う 避け 変える ほら 考え ある 響い 止め なく 這お めん 戸惑い もう 不快 見て取れ やす 訝しげ 思わず 上げる 思える 遊べ 段々 しまえ 何 分から 言う 記 別 周り だろ 安心 ありがとう 今日 とか まま すると 不安 ない 吐き出さ 高揚 巡らせ 見る 横倒し ただいま 続き 公園 雰囲気 当たり前 ハズ 押さ 木下 変わる 逃げる 決め 閉ざす エンディング 今 飲み込ん 達 乗っ 生 差し伸べ なり 優し 集中 用事 壊し 車 持た 放っ 欲しい ちょうど 分かっ 動か 見 という 待っ 走り去る くる 聞こえる とき 液体 絶頂 頷い 生臭い でも 震える 開始 だめ 根っこ こんな ブランコ 振り 帰っ 立て 見つけ カラス 時々 苦し おく 意識 する 取ら 込む オマエハ かなり 悪い 戻る 続ける だから ソンザイシナイモノ 知り合い 見覚え いや アノ 探す 階段 探し出し 急ぎ 漂い そちら 再開 いく 数え ドン びっくり 待ち そびえ立つ 醜く 歴 壊れ 察し 何処 もっと 実は 怖く 時間 戸惑う 行か 呆然と トモ なる これ 確信 言い方 はやっ どんな 学校 鬼 とくに 渡す 攻撃 嬉し 全く よい 飽き 揺れ動き 恥ずかしい みよ 話す 紡ぎ 見え 寂しく 遊べる ひっ その 見届ける ボリュ 困惑 地面 おどけ 潜める 友好 よう 歪ん 突然 上がっ 徐々に なれる 笑っ 家 息 知ら 用 変 経っ 明日 立た 降り 戸惑っ 表示 込 なきゃ 引き摺ら 変わら 振っ 引っかかる 近道 途端 違う 暮れ 緒 次 やっぱり みたい あんた 騒ぐ すぐ どうして ぶり 必死 はっ 筈 棒 わくわく 行っ 感じる しれ 飽きる いたし 皆目 形成 数える ノモノニナッタラ 動こ そしたら 配慮 幸せ 見つかり 出し みれ マタ ぬらぬら 分 ちゃっ 日 前 決める あげる 来 言動 空 行こ あんなに 拾い 匂い 遊び ポイント 覚え 怪物 非 自負 いつの間にか 信じ なろ 左右 たくさん たく 恐怖 友達 何者 ため 赤黒く 聞か こそ ちょっと ダメ 向かっ 大丈夫 もの 何分 そそのかさ 伏せ 霧散 それとも もがく そんな 更に あだ名 見つめる 壊れる 持っ 顔 見つめ 嘘 少し 証拠 とこ 不気味 なん でき つか 書い どれ ソウイウ 録 ついと 間 見つから 止む オフ 伝えよ どちら 未来永劫 終わり 会い じっと 降っ まさか 言え 訳 姿 過ぎ 遠く 囲ま 帰る 足音 形 近所 なに 茂み 気味悪く 始め そして 林 ヶ月 イヤ として 今度 たっ オマエハイナイ 消え 血 誰 名乗っ つけ 隠れよ がち って 考え込ん 獲物 隙間 らしい 現在 いる 内 光り 速い なっ 言葉 流し 言わさ ふと 矢継ぎ早 しか 腰掛け なんて 出る 気付く れる 鬱蒼 きょとん ゆっくり どこ あれ 心持ち 読 瞬 開い さえ 点 葉っぱ 塗ら 願っ 入っ 離れ 本当に いけ 判定 位置 数 うん でる わたし だらし 見渡す 流れ 疑う 宮田 直ぐ 通り 振り返る のに ほう 何事 未だ 話 感覚 たり 時折 静か 雨 色 狙う 上下 始まっ 見開き しまっ 渡さ 叶わ 口 たら 懸命 はぁ 暗闇 そろそろ 潰さ 頭 モトモト 紹介 さぁ よかっ ひょっこり 自己 夕焼け 心臓 話そ っけ 奥 覆わ 動く たい 迷っ 取得 首 気分 遅い やる気 ココニハイナイ 聞こえ ジョン 出来る だす 待つ 回り ただ 疲れ やっと でしょ 嬉しく ぽつり 分かる ぼく 隠れ スキップ 頃 暗く 真正面 飛び散る よし こちら まで 設定 走っ しかして だい 差す 闇 思え いよ たて きみ 振り向く さて 思う いつ 返事 体 気持ち 辺り 執着心 られ ずっと 音 まぁ 枯れ 倒れ 裏 似 ちゃんと 他 ちらりと 音量 足 決まっ 落ち かくれんぼ かけ 久しぶり 通る 歩く 今や 掴ん 笑い 名前 抑え 思い 愉快 いろいろ 処 やりとり あちら 気まずく 平気 部 木 話しかけ なぁ 左 始まり くれる 珍しい 赤い あっけなく 頷く 深い 想像 自分 見かけ 移動 声 孤独 必ず あんな 気付い まだ きれい 染め 遊ぼ 遊ん うし この 申し訳 喜び 振り向い かがん デモ ところ 思っ 優しかっ 走り 遠慮 見える 良かっ けっこう よく 影 埋め 見当 開け みせ じゃん 来る 青色 から ずつ 呼吸 ともLast modified on 2021-04-08 at 08:18
#22 by tomtheerogeman
2021-04-09 at 05:56
< report >So I compared both and the fix you used that reduced it to 732 is perfect, please keep it that way.

If you were wondering what happened, most of the missing words were different inflections of the same verbs, there were more of those in the original than I thought. Others were nonsense, grunts, stammers, and other little things that don't really matter (things like じゃあ、はぁ、つか、etc). The only 2 removed words I noticed that slightly bother me are ドロ and じゃん, but out of 732 words it clearly looks like the best we could do.

I don't need a GUI to use the program, as I already have some experience with command prompt exes. (cd C:\エロゲ\testers's tool, then tester's tool.exe something ) You said it's a month or so before release?

Maybe tomorrow I will have the default blacklist of words ready, and that's probably the last of my work unless you need anything further.
#23 by tester
2021-04-09 at 06:58
< report >
I don't need a GUI to use the program, as I already have some experience with command prompt exes. (cd C:\エロゲ\testers's tool, then tester's tool.exe something ) You said it's a month or so before release?

Well, without GUI (I hate creating GUI) I could release it in a few days.
I'll post here after releasing it.

===
But I'd like to do some magic with kanji sorting (by lists) first to improve the tool functionality even more.Last modified on 2021-04-09 at 06:59
#24 by tester
2021-04-09 at 08:08
< report >Okay, this is the a part of output of the new version of Kanji Data module.Last modified on 2021-04-09 at 09:55
#25 by tomtheerogeman
2021-04-10 at 04:45
< report >Now for the blacklist:

All of these words (they are verbs btw): 命を絶つ, 死を選ぶ, 死に追いやる, 轢く, 縊る, 息の根を止める, 討ち果す, 仕留める, 轢き殺す, 亡き者にする, 血塗る, 釁る, 盛り殺す, 取り殺す, 焼き殺す, 切り殺す, ぶち殺す, 敵を倒す, 命を奪う, 切り外す, 敵を討つ, 仇を討つ

Also these words, and words that contain these: 自殺, 心中, 自害, 自尽, 死, スーサイド , 自裁, 情死, 自刃, 血祭り, 殺害, 虐殺, 自剄, 自刎, 自絞死, 生害, 癌, キャンサー, 肺がん, 診断, 膣がん, 膵臓がん, 胃がん, 乳がん, 子宮がん, 大腸がん, 食道がん, 膀胱がん, 病院, レイプ, 姦, 追突, 自動車事故, 浮気, 結婚, 指輪, 妊婦, 妊娠, 孕, 症, 障害, 119, 119, 110, 110

Not a complete list of every bad thing or life changing event, but I think it's enough to hide any surprises.Last modified on 2021-04-10 at 04:48