graykode

(docs) how to training

1 +## How I trained model.
2 +We used the pre-trained weight provided by CodeBERT(Feng at al, 2020) as the initial weight.
3 +
4 +#### Added model
5 +To train the added model, you can train it using [CodeBERT's official repository](https://github.com/microsoft/CodeBERT). For training data, the cleaned CodeSearchNet was used. See [this document](https://github.com/microsoft/CodeBERT#fine-tune-1) for details. I took about 23 hours with 256 batch size.
6 +
7 +```shell script
8 +cd code2nl
9 +
10 +lang=python #programming language
11 +lr=5e-5
12 +batch_size=64
13 +beam_size=10
14 +source_length=256
15 +target_length=128
16 +data_dir=../data/code2nl/CodeSearchNet
17 +output_dir=model/$lang
18 +train_file=$data_dir/$lang/train.jsonl
19 +dev_file=$data_dir/$lang/valid.jsonl
20 +eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others
21 +train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others
22 +pretrained_model=microsoft/codebert-base #Roberta: roberta-base
23 +
24 +python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps
25 +```
26 +
27 +#### Diff model
28 +To train the Diff model we have to use [our code](https://github.com/graykode/commit-autosuggestions/blob/master/train.py). We need an implementation to differentiate between added and diff.
29 +As for the training data, only the top 100 repositories of the Python language in [the document](https://github.com/kaxap/arl/blob/master/README-Python.md) were cloned ([gitcloner.py](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)), and the commit message, added and deleted were preprocessed in jsonl format ([gitparser](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)). The data we used was put on a [google drive](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing).
30 +Like the added model, it took about 20 hours at 256 batch size for training.
31 +Note that the weight of the added model was used as the initial weight. Be sure to set this with the `load_model_path` argument.
32 +
33 +```shell script
34 +lr=5e-5
35 +batch_size=64
36 +beam_size=10
37 +source_length=256
38 +target_length=128
39 +output_dir=model/python
40 +train_file=train.jsonl
41 +dev_file=valid.jsonl
42 +
43 +eval_steps=1000
44 +train_steps=50000
45 +saved_model=pytorch_model.bin # this is added model weight
46 +
47 +python train.py --do_train --do_eval --model_type roberta \
48 + --model_name_or_path microsoft/codebert-base \
49 + --load_model_path $saved_model \
50 + --train_filename $train_file \
51 + --dev_filename $dev_file \
52 + --output_dir $output_dir \
53 + --max_source_length $source_length \
54 + --max_target_length $target_length \
55 + --beam_size $beam_size \
56 + --train_batch_size $batch_size \
57 + --eval_batch_size $batch_size \
58 + --learning_rate $lr \
59 + --train_steps $train_steps \
60 + --eval_steps $eval_steps
61 +```
62 +
63 +## How to train for your lint style?
64 +See the [Diff model](#Diff model) section above for the role of the code.
65 +
66 +#### 1. cloning repositories from github
67 +This code clones all repositories in [repositories.txt](https://github.com/graykode/commit-autosuggestions/blob/master/repositories.txt).
68 +```shell script
69 +usage: gitcloner.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR [--num_worker_threads NUM_WORKER_THREADS]
70 +
71 +optional arguments:
72 + -h, --help show this help message and exit
73 + --repositories REPOSITORIES
74 + repositories file path.
75 + --repos_dir REPOS_DIR
76 + directory that all repositories will be downloaded.
77 + --num_worker_threads NUM_WORKER_THREADS
78 + number of threads in a worker
79 +
80 +```
81 +
82 +#### 2. parsing added code, deleted code and commit message from cloned repositories.
83 +This code preprocesses cloned repositories and divides them into train, valid, and test data.
84 +
85 +```shell script
86 +usage: gitparser.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR --output_dir OUTPUT_DIR [--tokenizer_name TOKENIZER_NAME] [--num_workers NUM_WORKERS]
87 + [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH]
88 +
89 +optional arguments:
90 + -h, --help show this help message and exit
91 + --repositories REPOSITORIES
92 + repositories file path.
93 + --repos_dir REPOS_DIR
94 + directory that all repositories had been downloaded.
95 + --output_dir OUTPUT_DIR
96 + The output directory where the preprocessed data will be written.
97 + --tokenizer_name TOKENIZER_NAME
98 + The name of tokenizer
99 + --num_workers NUM_WORKERS
100 + number of process
101 + --max_source_length MAX_SOURCE_LENGTH
102 + The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
103 + --max_target_length MAX_TARGET_LENGTH
104 + The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
105 +```
106 +
107 +#### 3. Training Added model(Optional for Python Language).
108 +Python has learned the Added model. So, if you only want to make a Diff model for the Python language, step 3 can be ignored. However, for other languages (JavaScript, GO, Ruby, PHP and JAVA), [Code2NL training](https://github.com/microsoft/CodeBERT#fine-tune-1) is required to use as the initial weight of the model to be used in step 4.
109 +
110 +#### 4. Training Diff model.
111 +Train the Diff model as the initial weight of the added model for each languages.
112 +
113 +```shell script
114 +usage: train.py [-h] --model_type MODEL_TYPE --model_name_or_path MODEL_NAME_OR_PATH --output_dir OUTPUT_DIR [--load_model_path LOAD_MODEL_PATH]
115 + [--train_filename TRAIN_FILENAME] [--dev_filename DEV_FILENAME] [--test_filename TEST_FILENAME] [--config_name CONFIG_NAME] [--tokenizer_name TOKENIZER_NAME]
116 + [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH] [--do_train] [--do_eval] [--do_test] [--do_lower_case] [--no_cuda]
117 + [--train_batch_size TRAIN_BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
118 + [--learning_rate LEARNING_RATE] [--beam_size BEAM_SIZE] [--weight_decay WEIGHT_DECAY] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM]
119 + [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--train_steps TRAIN_STEPS] [--warmup_steps WARMUP_STEPS]
120 + [--local_rank LOCAL_RANK] [--seed SEED]
121 +
122 +optional arguments:
123 + -h, --help show this help message and exit
124 + --model_type MODEL_TYPE
125 + Model type: e.g. roberta
126 + --model_name_or_path MODEL_NAME_OR_PATH
127 + Path to pre-trained model: e.g. roberta-base
128 + --output_dir OUTPUT_DIR
129 + The output directory where the model predictions and checkpoints will be written.
130 + --load_model_path LOAD_MODEL_PATH
131 + Path to trained model: Should contain the .bin files
132 + --train_filename TRAIN_FILENAME
133 + The train filename. Should contain the .jsonl files for this task.
134 + --dev_filename DEV_FILENAME
135 + The dev filename. Should contain the .jsonl files for this task.
136 + --test_filename TEST_FILENAME
137 + The test filename. Should contain the .jsonl files for this task.
138 + --config_name CONFIG_NAME
139 + Pretrained config name or path if not the same as model_name
140 + --tokenizer_name TOKENIZER_NAME
141 + Pretrained tokenizer name or path if not the same as model_name
142 + --max_source_length MAX_SOURCE_LENGTH
143 + The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
144 + --max_target_length MAX_TARGET_LENGTH
145 + The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
146 + --do_train Whether to run training.
147 + --do_eval Whether to run eval on the dev set.
148 + --do_test Whether to run eval on the dev set.
149 + --do_lower_case Set this flag if you are using an uncased model.
150 + --no_cuda Avoid using CUDA when available
151 + --train_batch_size TRAIN_BATCH_SIZE
152 + Batch size per GPU/CPU for training.
153 + --eval_batch_size EVAL_BATCH_SIZE
154 + Batch size per GPU/CPU for evaluation.
155 + --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
156 + Number of updates steps to accumulate before performing a backward/update pass.
157 + --learning_rate LEARNING_RATE
158 + The initial learning rate for Adam.
159 + --beam_size BEAM_SIZE
160 + beam size for beam search
161 + --weight_decay WEIGHT_DECAY
162 + Weight deay if we apply some.
163 + --adam_epsilon ADAM_EPSILON
164 + Epsilon for Adam optimizer.
165 + --max_grad_norm MAX_GRAD_NORM
166 + Max gradient norm.
167 + --num_train_epochs NUM_TRAIN_EPOCHS
168 + Total number of training epochs to perform.
169 + --max_steps MAX_STEPS
170 + If > 0: set total number of training steps to perform. Override num_train_epochs.
171 + --eval_steps EVAL_STEPS
172 + --train_steps TRAIN_STEPS
173 + --warmup_steps WARMUP_STEPS
174 + Linear warmup over warmup_steps.
175 + --local_rank LOCAL_RANK
176 + For distributed training: local_rank
177 + --seed SEED random seed for initialization
178 +```
...\ No newline at end of file ...\ No newline at end of file