This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing"TrainKoSpacing"), using FastText instead of Word2Vec
본 프로젝트는 (주) 미리내에서 진행한 산학 연계 프로젝트로 별도 사내의 버전 관리 서비스를 이용하였음으로, 히스토리 내역 없이 소스 코드 및 데모 코드만 첨부하였음.
## Introduction
This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing"TrainKoSpacing"), using FastText instead of Word2Vec.
If you want detail information you can watch our [presentation video](https://drive.google.com/file/d/1f-D3DC8cnrRniLvoAJ4WyreCjoo316Yc/view?usp=sharing"KCC Presentation Video "presentation video")
## Performances
| Model | Test Accuracy(%) | Encoding Time Cost |
...
...
@@ -12,7 +18,9 @@ This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon
#### Corpus
We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below
[National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/).
[National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about"National Information Society Agency AI-Hub")
#### Data format
...
...
@@ -34,8 +42,11 @@ Bziped file consisting of one sentence per line.
### Word Embedding
#### 자모분해
To get similar shpae of Korean charector, use 자모분해 FastText word embedding.
ex)
자연어처리
ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ –
#### 2 stage FastText
...
...
@@ -47,6 +58,7 @@ Because middle part of output distribution are evenly distributed.