Showing
1 changed file
with
178 additions
and
0 deletions
docs/training.md
0 → 100644
1 | +## How I trained model. | ||
2 | +We used the pre-trained weight provided by CodeBERT(Feng at al, 2020) as the initial weight. | ||
3 | + | ||
4 | +#### Added model | ||
5 | +To train the added model, you can train it using [CodeBERT's official repository](https://github.com/microsoft/CodeBERT). For training data, the cleaned CodeSearchNet was used. See [this document](https://github.com/microsoft/CodeBERT#fine-tune-1) for details. I took about 23 hours with 256 batch size. | ||
6 | + | ||
7 | +```shell script | ||
8 | +cd code2nl | ||
9 | + | ||
10 | +lang=python #programming language | ||
11 | +lr=5e-5 | ||
12 | +batch_size=64 | ||
13 | +beam_size=10 | ||
14 | +source_length=256 | ||
15 | +target_length=128 | ||
16 | +data_dir=../data/code2nl/CodeSearchNet | ||
17 | +output_dir=model/$lang | ||
18 | +train_file=$data_dir/$lang/train.jsonl | ||
19 | +dev_file=$data_dir/$lang/valid.jsonl | ||
20 | +eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others | ||
21 | +train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others | ||
22 | +pretrained_model=microsoft/codebert-base #Roberta: roberta-base | ||
23 | + | ||
24 | +python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps | ||
25 | +``` | ||
26 | + | ||
27 | +#### Diff model | ||
28 | +To train the Diff model we have to use [our code](https://github.com/graykode/commit-autosuggestions/blob/master/train.py). We need an implementation to differentiate between added and diff. | ||
29 | +As for the training data, only the top 100 repositories of the Python language in [the document](https://github.com/kaxap/arl/blob/master/README-Python.md) were cloned ([gitcloner.py](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)), and the commit message, added and deleted were preprocessed in jsonl format ([gitparser](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)). The data we used was put on a [google drive](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing). | ||
30 | +Like the added model, it took about 20 hours at 256 batch size for training. | ||
31 | +Note that the weight of the added model was used as the initial weight. Be sure to set this with the `load_model_path` argument. | ||
32 | + | ||
33 | +```shell script | ||
34 | +lr=5e-5 | ||
35 | +batch_size=64 | ||
36 | +beam_size=10 | ||
37 | +source_length=256 | ||
38 | +target_length=128 | ||
39 | +output_dir=model/python | ||
40 | +train_file=train.jsonl | ||
41 | +dev_file=valid.jsonl | ||
42 | + | ||
43 | +eval_steps=1000 | ||
44 | +train_steps=50000 | ||
45 | +saved_model=pytorch_model.bin # this is added model weight | ||
46 | + | ||
47 | +python train.py --do_train --do_eval --model_type roberta \ | ||
48 | + --model_name_or_path microsoft/codebert-base \ | ||
49 | + --load_model_path $saved_model \ | ||
50 | + --train_filename $train_file \ | ||
51 | + --dev_filename $dev_file \ | ||
52 | + --output_dir $output_dir \ | ||
53 | + --max_source_length $source_length \ | ||
54 | + --max_target_length $target_length \ | ||
55 | + --beam_size $beam_size \ | ||
56 | + --train_batch_size $batch_size \ | ||
57 | + --eval_batch_size $batch_size \ | ||
58 | + --learning_rate $lr \ | ||
59 | + --train_steps $train_steps \ | ||
60 | + --eval_steps $eval_steps | ||
61 | +``` | ||
62 | + | ||
63 | +## How to train for your lint style? | ||
64 | +See the [Diff model](#Diff model) section above for the role of the code. | ||
65 | + | ||
66 | +#### 1. cloning repositories from github | ||
67 | +This code clones all repositories in [repositories.txt](https://github.com/graykode/commit-autosuggestions/blob/master/repositories.txt). | ||
68 | +```shell script | ||
69 | +usage: gitcloner.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR [--num_worker_threads NUM_WORKER_THREADS] | ||
70 | + | ||
71 | +optional arguments: | ||
72 | + -h, --help show this help message and exit | ||
73 | + --repositories REPOSITORIES | ||
74 | + repositories file path. | ||
75 | + --repos_dir REPOS_DIR | ||
76 | + directory that all repositories will be downloaded. | ||
77 | + --num_worker_threads NUM_WORKER_THREADS | ||
78 | + number of threads in a worker | ||
79 | + | ||
80 | +``` | ||
81 | + | ||
82 | +#### 2. parsing added code, deleted code and commit message from cloned repositories. | ||
83 | +This code preprocesses cloned repositories and divides them into train, valid, and test data. | ||
84 | + | ||
85 | +```shell script | ||
86 | +usage: gitparser.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR --output_dir OUTPUT_DIR [--tokenizer_name TOKENIZER_NAME] [--num_workers NUM_WORKERS] | ||
87 | + [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH] | ||
88 | + | ||
89 | +optional arguments: | ||
90 | + -h, --help show this help message and exit | ||
91 | + --repositories REPOSITORIES | ||
92 | + repositories file path. | ||
93 | + --repos_dir REPOS_DIR | ||
94 | + directory that all repositories had been downloaded. | ||
95 | + --output_dir OUTPUT_DIR | ||
96 | + The output directory where the preprocessed data will be written. | ||
97 | + --tokenizer_name TOKENIZER_NAME | ||
98 | + The name of tokenizer | ||
99 | + --num_workers NUM_WORKERS | ||
100 | + number of process | ||
101 | + --max_source_length MAX_SOURCE_LENGTH | ||
102 | + The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
103 | + --max_target_length MAX_TARGET_LENGTH | ||
104 | + The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
105 | +``` | ||
106 | + | ||
107 | +#### 3. Training Added model(Optional for Python Language). | ||
108 | +Python has learned the Added model. So, if you only want to make a Diff model for the Python language, step 3 can be ignored. However, for other languages (JavaScript, GO, Ruby, PHP and JAVA), [Code2NL training](https://github.com/microsoft/CodeBERT#fine-tune-1) is required to use as the initial weight of the model to be used in step 4. | ||
109 | + | ||
110 | +#### 4. Training Diff model. | ||
111 | +Train the Diff model as the initial weight of the added model for each languages. | ||
112 | + | ||
113 | +```shell script | ||
114 | +usage: train.py [-h] --model_type MODEL_TYPE --model_name_or_path MODEL_NAME_OR_PATH --output_dir OUTPUT_DIR [--load_model_path LOAD_MODEL_PATH] | ||
115 | + [--train_filename TRAIN_FILENAME] [--dev_filename DEV_FILENAME] [--test_filename TEST_FILENAME] [--config_name CONFIG_NAME] [--tokenizer_name TOKENIZER_NAME] | ||
116 | + [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH] [--do_train] [--do_eval] [--do_test] [--do_lower_case] [--no_cuda] | ||
117 | + [--train_batch_size TRAIN_BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] | ||
118 | + [--learning_rate LEARNING_RATE] [--beam_size BEAM_SIZE] [--weight_decay WEIGHT_DECAY] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] | ||
119 | + [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--train_steps TRAIN_STEPS] [--warmup_steps WARMUP_STEPS] | ||
120 | + [--local_rank LOCAL_RANK] [--seed SEED] | ||
121 | + | ||
122 | +optional arguments: | ||
123 | + -h, --help show this help message and exit | ||
124 | + --model_type MODEL_TYPE | ||
125 | + Model type: e.g. roberta | ||
126 | + --model_name_or_path MODEL_NAME_OR_PATH | ||
127 | + Path to pre-trained model: e.g. roberta-base | ||
128 | + --output_dir OUTPUT_DIR | ||
129 | + The output directory where the model predictions and checkpoints will be written. | ||
130 | + --load_model_path LOAD_MODEL_PATH | ||
131 | + Path to trained model: Should contain the .bin files | ||
132 | + --train_filename TRAIN_FILENAME | ||
133 | + The train filename. Should contain the .jsonl files for this task. | ||
134 | + --dev_filename DEV_FILENAME | ||
135 | + The dev filename. Should contain the .jsonl files for this task. | ||
136 | + --test_filename TEST_FILENAME | ||
137 | + The test filename. Should contain the .jsonl files for this task. | ||
138 | + --config_name CONFIG_NAME | ||
139 | + Pretrained config name or path if not the same as model_name | ||
140 | + --tokenizer_name TOKENIZER_NAME | ||
141 | + Pretrained tokenizer name or path if not the same as model_name | ||
142 | + --max_source_length MAX_SOURCE_LENGTH | ||
143 | + The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
144 | + --max_target_length MAX_TARGET_LENGTH | ||
145 | + The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
146 | + --do_train Whether to run training. | ||
147 | + --do_eval Whether to run eval on the dev set. | ||
148 | + --do_test Whether to run eval on the dev set. | ||
149 | + --do_lower_case Set this flag if you are using an uncased model. | ||
150 | + --no_cuda Avoid using CUDA when available | ||
151 | + --train_batch_size TRAIN_BATCH_SIZE | ||
152 | + Batch size per GPU/CPU for training. | ||
153 | + --eval_batch_size EVAL_BATCH_SIZE | ||
154 | + Batch size per GPU/CPU for evaluation. | ||
155 | + --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS | ||
156 | + Number of updates steps to accumulate before performing a backward/update pass. | ||
157 | + --learning_rate LEARNING_RATE | ||
158 | + The initial learning rate for Adam. | ||
159 | + --beam_size BEAM_SIZE | ||
160 | + beam size for beam search | ||
161 | + --weight_decay WEIGHT_DECAY | ||
162 | + Weight deay if we apply some. | ||
163 | + --adam_epsilon ADAM_EPSILON | ||
164 | + Epsilon for Adam optimizer. | ||
165 | + --max_grad_norm MAX_GRAD_NORM | ||
166 | + Max gradient norm. | ||
167 | + --num_train_epochs NUM_TRAIN_EPOCHS | ||
168 | + Total number of training epochs to perform. | ||
169 | + --max_steps MAX_STEPS | ||
170 | + If > 0: set total number of training steps to perform. Override num_train_epochs. | ||
171 | + --eval_steps EVAL_STEPS | ||
172 | + --train_steps TRAIN_STEPS | ||
173 | + --warmup_steps WARMUP_STEPS | ||
174 | + Linear warmup over warmup_steps. | ||
175 | + --local_rank LOCAL_RANK | ||
176 | + For distributed training: local_rank | ||
177 | + --seed SEED random seed for initialization | ||
178 | +``` | ||
... | \ No newline at end of file | ... | \ No newline at end of file |
-
Please register or login to post a comment