README.md



Ko-Sentence-BERT

https://github.com/BM-K/KoSentenceBERT 

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) 논문에서 공개한 코드, kakaobrain 팀이 공개한 KorNLUDatasets 과 ETRI KorBERT를 통해 Korea Sentence BERT를 학습하였습니다.


Installation

ETRI KorBERT는 transformers 2.4.1 ~ 2.8.0에서만 동작하고 Sentence-BERT는 3.1.0 버전 이상에서 동작하여 라이브러리를 수정하였습니다. 

huggingface transformer, sentence transformers, tokenizers 라이브러리 코드를 직접 수정하므로 가상환경 사용을 권장합니다.  

사용한 Docker image는 Docker Hub에 첨부합니다. https://hub.docker.com/r/klbm126/kosbert_image/tags 

ETRI KoBERT를 사용하여 학습하였고 본 레파지토리에선 ETRI KoBERT를 제공하지 않습니다.

git clone https://github.com/BM-K/KoSentenceBERT.git
python -m venv .KoSBERT
. .KoSBERT/bin/activate
pip install -r requirements.txt


transformer, tokenizers, sentence_transformers 디렉토리를 .KoSBERT/lib/python3.7/site-packages/ 로 이동합니다. 

ETRI_KoBERT 모델과 tokenizer가 KoSentenceBERT 디렉토리 안에 존재하여야 합니다.

ETRI 모델과 tokenizer는 다음 예시와 같이 불러옵니다 :

from ETRI_tok.tokenization_etri_eojeol import BertTokenizer
self.auto_model = BertModel.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch') 
self.tokenizer = BertTokenizer.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch/vocab.txt', do_lower_case=False)


Train Models

모델 학습을 원하시면 KoSentenceBERT 디렉토리 안에 KorNLUDatasets이 존재하여야 합니다. 

저는 STS를 학습할 때에 모델 구조에 맞게 STS데이터를 수정하여 사용하였고 데이터와 학습 방법은 아래와 같습니다 : 


KoSentenceBERT/KorNLUDatates/KorSTS/tune_test.tsv 


STS test 데이터셋의 일부 


python training_nli.py      # NLI 데이터로만 학습
python training_sts.py      # STS 데이터로만 학습
python con_training_sts.py  # NLI 데이터로 학습 후 STS 데이터로 Fine-Tuning


Pre-Trained Models

pooling mode는 MEAN-strategy를 사용하였으며, 학습시 모델은 output 디렉토리에 저장 됩니다. 


Performance

Seed 고정, Dev set


향상된 모델 성능은 다음 레파지토리에 있습니다. 
 https://github.com/BM-K/KoSentenceBERT_V2


Application Examples

생성 된 문장 임베딩을 다운 스트림 애플리케이션에 사용할 수 있는 방법에 대한 몇 가지 예를 제시합니다.

 제일 높은 성능을 내는 STS + NLI pretrained 모델을 통해 진행합니다.


Semantic Search

SemanticSearch.py는 주어진 문장과 유사한 문장을 찾는 작업입니다.

먼저 Corpus의 모든 문장에 대한 임베딩을 생성합니다.

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['한 남자가 음식을 먹는다.',
          '한 남자가 빵 한 조각을 먹는다.',
          '그 여자가 아이를 돌본다.',
          '한 남자가 말을 탄다.',
          '한 여자가 바이올린을 연주한다.',
          '두 남자가 수레를 숲 속으로 밀었다.',
          '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
          '원숭이 한 마리가 드럼을 연주한다.',
          '치타 한 마리가 먹이 뒤에서 달리고 있다.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['한 남자가 파스타를 먹는다.',
           '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
           '치타가 들판을 가로 질러 먹이를 쫓는다.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))


 결과는 다음과 같습니다 :

========================


Query: 한 남자가 파스타를 먹는다.

Top 5 most similar sentences in corpus:
한 남자가 음식을 먹는다. (Score: 0.7557)
한 남자가 빵 한 조각을 먹는다. (Score: 0.6464)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.2565)
한 남자가 말을 탄다. (Score: 0.2333)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1792)


========================


Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.

Top 5 most similar sentences in corpus:
원숭이 한 마리가 드럼을 연주한다. (Score: 0.6732)
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.3401)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1037)
한 남자가 음식을 먹는다. (Score: 0.0617)
그 여자가 아이를 돌본다. (Score: 0.0466)


=======================


Query: 치타가 들판을 가로 질러 먹이를 쫓는다.

Top 5 most similar sentences in corpus:
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.7164)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.3216)
원숭이 한 마리가 드럼을 연주한다. (Score: 0.2071)
한 남자가 빵 한 조각을 먹는다. (Score: 0.1089)
한 남자가 음식을 먹는다. (Score: 0.0724)


Clustering

Clustering.py는 문장 임베딩 유사성을 기반으로 유사한 문장을 클러스터링하는 예를 보여줍니다. 

이전과 마찬가지로 먼저 각 문장에 대한 임베딩을 계산합니다. 


from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['한 남자가 음식을 먹는다.',
          '한 남자가 빵 한 조각을 먹는다.',
          '그 여자가 아이를 돌본다.',
          '한 남자가 말을 탄다.',
          '한 여자가 바이올린을 연주한다.',
          '두 남자가 수레를 숲 속으로 밀었다.',
          '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
          '원숭이 한 마리가 드럼을 연주한다.',
          '치타 한 마리가 먹이 뒤에서 달리고 있다.',
          '한 남자가 파스타를 먹는다.',
          '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
          '치타가 들판을 가로 질러 먹이를 쫓는다.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")


결과는 다음과 같습니다 :

 Cluster  1
['두 남자가 수레를 숲 속으로 밀었다.', '치타 한 마리가 먹이 뒤에서 달리고 있다.', '치타가 들판을 가로 질러 먹이를 쫓는다.']

Cluster  2
['한 남자가 말을 탄다.', '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.']

Cluster  3
['한 남자가 음식을 먹는다.', '한 남자가 빵 한 조각을 먹는다.', '한 남자가 파스타를 먹는다.']

Cluster  4
['그 여자가 아이를 돌본다.', '한 여자가 바이올린을 연주한다.']

Cluster  5
['원숭이 한 마리가 드럼을 연주한다.', '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.']


Downstream Tasks Demo

 
데모 영상으로 제작된 버전은 다음 레파지토리에 있습니다. 
 https://github.com/BM-K/KoSentenceBERT_V2


Citing


KorNLU Datasets

@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}


Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}

@article{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    journal= "arXiv preprint arXiv:2004.09813",
    month = "04",
    year = "2020",
    url = "http://arxiv.org/abs/2004.09813",
}