Showing
2 changed files
with
31 additions
and
2 deletions
1 | # ML base Spacing Correcter | 1 | # ML base Spacing Correcter |
2 | -This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec | 2 | +본 프로젝트는 (주) 미리내에서 진행한 산학 연계 프로젝트로 별도 사내의 버전 관리 서비스를 이용하였음으로, 히스토리 내역 없이 소스 코드 및 데모 코드만 첨부하였음. |
3 | + | ||
4 | +## Introduction | ||
5 | +This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec. | ||
6 | + | ||
7 | +If you want detail information you can watch our [presentation video](https://drive.google.com/file/d/1f-D3DC8cnrRniLvoAJ4WyreCjoo316Yc/view?usp=sharing "KCC Presentation Video "presentation video") | ||
8 | + | ||
3 | 9 | ||
4 | ## Performances | 10 | ## Performances |
5 | | Model | Test Accuracy(%) | Encoding Time Cost | | 11 | | Model | Test Accuracy(%) | Encoding Time Cost | |
... | @@ -12,7 +18,9 @@ This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon | ... | @@ -12,7 +18,9 @@ This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon |
12 | #### Corpus | 18 | #### Corpus |
13 | 19 | ||
14 | We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below | 20 | We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below |
21 | + | ||
15 | [National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/). | 22 | [National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/). |
23 | + | ||
16 | [National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub") | 24 | [National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub") |
17 | 25 | ||
18 | #### Data format | 26 | #### Data format |
... | @@ -34,8 +42,11 @@ Bziped file consisting of one sentence per line. | ... | @@ -34,8 +42,11 @@ Bziped file consisting of one sentence per line. |
34 | ### Word Embedding | 42 | ### Word Embedding |
35 | #### 자모분해 | 43 | #### 자모분해 |
36 | To get similar shpae of Korean charector, use 자모분해 FastText word embedding. | 44 | To get similar shpae of Korean charector, use 자모분해 FastText word embedding. |
45 | + | ||
37 | ex) | 46 | ex) |
47 | + | ||
38 | 자연어처리 | 48 | 자연어처리 |
49 | + | ||
39 | ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ – | 50 | ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ – |
40 | 51 | ||
41 | #### 2 stage FastText | 52 | #### 2 stage FastText |
... | @@ -47,6 +58,7 @@ Because middle part of output distribution are evenly distributed. | ... | @@ -47,6 +58,7 @@ Because middle part of output distribution are evenly distributed. |
47 | ![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png) | 58 | ![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png) |
48 | 59 | ||
49 | Use log transform and second derivative | 60 | Use log transform and second derivative |
61 | + | ||
50 | result: | 62 | result: |
51 | ![Thresholding_result](img/Thresholding_result.png) | 63 | ![Thresholding_result](img/Thresholding_result.png) |
52 | 64 | ||
... | @@ -110,7 +122,24 @@ Directory guide for embedding model files | ... | @@ -110,7 +122,24 @@ Directory guide for embedding model files |
110 | - **kospacing_wv.np** | 122 | - **kospacing_wv.np** |
111 | - **w2idx.dic** | 123 | - **w2idx.dic** |
112 | 124 | ||
113 | -### Reference | 125 | +## Demo |
126 | +![demo_img](img/demo_screenshot.png) | ||
127 | +You can watch Demo video [here](https://drive.google.com/file/d/1fYKapmplTmVKVxypj0-bB2TFV_2IiBO_/view?usp=sharing "here") | ||
128 | + | ||
129 | +### How to Run Demo | ||
130 | +#### 1. run demo server | ||
131 | +```bash | ||
132 | +cd demo | ||
133 | +python server.py | ||
134 | +``` | ||
135 | +#### 2. open demo client page | ||
136 | +open html file on path: demo/front-client/client_demo.html | ||
137 | + | ||
138 | +Input Korean sentence and click submit | ||
139 | + | ||
140 | + | ||
141 | +## Reference | ||
114 | TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing | 142 | TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing |
143 | + | ||
115 | 딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155 | 144 | 딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155 |
116 | 145 | ... | ... |
img/demo_screenshot.png
0 → 100644
60.9 KB
-
Please register or login to post a comment