Showing
1 changed file
with
178 additions
and
0 deletions
docs/training.md
0 → 100644
| 1 | +## How I trained model. | ||
| 2 | +We used the pre-trained weight provided by CodeBERT(Feng at al, 2020) as the initial weight. | ||
| 3 | + | ||
| 4 | +#### Added model | ||
| 5 | +To train the added model, you can train it using [CodeBERT's official repository](https://github.com/microsoft/CodeBERT). For training data, the cleaned CodeSearchNet was used. See [this document](https://github.com/microsoft/CodeBERT#fine-tune-1) for details. I took about 23 hours with 256 batch size. | ||
| 6 | + | ||
| 7 | +```shell script | ||
| 8 | +cd code2nl | ||
| 9 | + | ||
| 10 | +lang=python #programming language | ||
| 11 | +lr=5e-5 | ||
| 12 | +batch_size=64 | ||
| 13 | +beam_size=10 | ||
| 14 | +source_length=256 | ||
| 15 | +target_length=128 | ||
| 16 | +data_dir=../data/code2nl/CodeSearchNet | ||
| 17 | +output_dir=model/$lang | ||
| 18 | +train_file=$data_dir/$lang/train.jsonl | ||
| 19 | +dev_file=$data_dir/$lang/valid.jsonl | ||
| 20 | +eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others | ||
| 21 | +train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others | ||
| 22 | +pretrained_model=microsoft/codebert-base #Roberta: roberta-base | ||
| 23 | + | ||
| 24 | +python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps | ||
| 25 | +``` | ||
| 26 | + | ||
| 27 | +#### Diff model | ||
| 28 | +To train the Diff model we have to use [our code](https://github.com/graykode/commit-autosuggestions/blob/master/train.py). We need an implementation to differentiate between added and diff. | ||
| 29 | +As for the training data, only the top 100 repositories of the Python language in [the document](https://github.com/kaxap/arl/blob/master/README-Python.md) were cloned ([gitcloner.py](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)), and the commit message, added and deleted were preprocessed in jsonl format ([gitparser](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)). The data we used was put on a [google drive](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing). | ||
| 30 | +Like the added model, it took about 20 hours at 256 batch size for training. | ||
| 31 | +Note that the weight of the added model was used as the initial weight. Be sure to set this with the `load_model_path` argument. | ||
| 32 | + | ||
| 33 | +```shell script | ||
| 34 | +lr=5e-5 | ||
| 35 | +batch_size=64 | ||
| 36 | +beam_size=10 | ||
| 37 | +source_length=256 | ||
| 38 | +target_length=128 | ||
| 39 | +output_dir=model/python | ||
| 40 | +train_file=train.jsonl | ||
| 41 | +dev_file=valid.jsonl | ||
| 42 | + | ||
| 43 | +eval_steps=1000 | ||
| 44 | +train_steps=50000 | ||
| 45 | +saved_model=pytorch_model.bin # this is added model weight | ||
| 46 | + | ||
| 47 | +python train.py --do_train --do_eval --model_type roberta \ | ||
| 48 | + --model_name_or_path microsoft/codebert-base \ | ||
| 49 | + --load_model_path $saved_model \ | ||
| 50 | + --train_filename $train_file \ | ||
| 51 | + --dev_filename $dev_file \ | ||
| 52 | + --output_dir $output_dir \ | ||
| 53 | + --max_source_length $source_length \ | ||
| 54 | + --max_target_length $target_length \ | ||
| 55 | + --beam_size $beam_size \ | ||
| 56 | + --train_batch_size $batch_size \ | ||
| 57 | + --eval_batch_size $batch_size \ | ||
| 58 | + --learning_rate $lr \ | ||
| 59 | + --train_steps $train_steps \ | ||
| 60 | + --eval_steps $eval_steps | ||
| 61 | +``` | ||
| 62 | + | ||
| 63 | +## How to train for your lint style? | ||
| 64 | +See the [Diff model](#Diff model) section above for the role of the code. | ||
| 65 | + | ||
| 66 | +#### 1. cloning repositories from github | ||
| 67 | +This code clones all repositories in [repositories.txt](https://github.com/graykode/commit-autosuggestions/blob/master/repositories.txt). | ||
| 68 | +```shell script | ||
| 69 | +usage: gitcloner.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR [--num_worker_threads NUM_WORKER_THREADS] | ||
| 70 | + | ||
| 71 | +optional arguments: | ||
| 72 | + -h, --help show this help message and exit | ||
| 73 | + --repositories REPOSITORIES | ||
| 74 | + repositories file path. | ||
| 75 | + --repos_dir REPOS_DIR | ||
| 76 | + directory that all repositories will be downloaded. | ||
| 77 | + --num_worker_threads NUM_WORKER_THREADS | ||
| 78 | + number of threads in a worker | ||
| 79 | + | ||
| 80 | +``` | ||
| 81 | + | ||
| 82 | +#### 2. parsing added code, deleted code and commit message from cloned repositories. | ||
| 83 | +This code preprocesses cloned repositories and divides them into train, valid, and test data. | ||
| 84 | + | ||
| 85 | +```shell script | ||
| 86 | +usage: gitparser.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR --output_dir OUTPUT_DIR [--tokenizer_name TOKENIZER_NAME] [--num_workers NUM_WORKERS] | ||
| 87 | + [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH] | ||
| 88 | + | ||
| 89 | +optional arguments: | ||
| 90 | + -h, --help show this help message and exit | ||
| 91 | + --repositories REPOSITORIES | ||
| 92 | + repositories file path. | ||
| 93 | + --repos_dir REPOS_DIR | ||
| 94 | + directory that all repositories had been downloaded. | ||
| 95 | + --output_dir OUTPUT_DIR | ||
| 96 | + The output directory where the preprocessed data will be written. | ||
| 97 | + --tokenizer_name TOKENIZER_NAME | ||
| 98 | + The name of tokenizer | ||
| 99 | + --num_workers NUM_WORKERS | ||
| 100 | + number of process | ||
| 101 | + --max_source_length MAX_SOURCE_LENGTH | ||
| 102 | + The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
| 103 | + --max_target_length MAX_TARGET_LENGTH | ||
| 104 | + The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
| 105 | +``` | ||
| 106 | + | ||
| 107 | +#### 3. Training Added model(Optional for Python Language). | ||
| 108 | +Python has learned the Added model. So, if you only want to make a Diff model for the Python language, step 3 can be ignored. However, for other languages (JavaScript, GO, Ruby, PHP and JAVA), [Code2NL training](https://github.com/microsoft/CodeBERT#fine-tune-1) is required to use as the initial weight of the model to be used in step 4. | ||
| 109 | + | ||
| 110 | +#### 4. Training Diff model. | ||
| 111 | +Train the Diff model as the initial weight of the added model for each languages. | ||
| 112 | + | ||
| 113 | +```shell script | ||
| 114 | +usage: train.py [-h] --model_type MODEL_TYPE --model_name_or_path MODEL_NAME_OR_PATH --output_dir OUTPUT_DIR [--load_model_path LOAD_MODEL_PATH] | ||
| 115 | + [--train_filename TRAIN_FILENAME] [--dev_filename DEV_FILENAME] [--test_filename TEST_FILENAME] [--config_name CONFIG_NAME] [--tokenizer_name TOKENIZER_NAME] | ||
| 116 | + [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH] [--do_train] [--do_eval] [--do_test] [--do_lower_case] [--no_cuda] | ||
| 117 | + [--train_batch_size TRAIN_BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] | ||
| 118 | + [--learning_rate LEARNING_RATE] [--beam_size BEAM_SIZE] [--weight_decay WEIGHT_DECAY] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] | ||
| 119 | + [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--train_steps TRAIN_STEPS] [--warmup_steps WARMUP_STEPS] | ||
| 120 | + [--local_rank LOCAL_RANK] [--seed SEED] | ||
| 121 | + | ||
| 122 | +optional arguments: | ||
| 123 | + -h, --help show this help message and exit | ||
| 124 | + --model_type MODEL_TYPE | ||
| 125 | + Model type: e.g. roberta | ||
| 126 | + --model_name_or_path MODEL_NAME_OR_PATH | ||
| 127 | + Path to pre-trained model: e.g. roberta-base | ||
| 128 | + --output_dir OUTPUT_DIR | ||
| 129 | + The output directory where the model predictions and checkpoints will be written. | ||
| 130 | + --load_model_path LOAD_MODEL_PATH | ||
| 131 | + Path to trained model: Should contain the .bin files | ||
| 132 | + --train_filename TRAIN_FILENAME | ||
| 133 | + The train filename. Should contain the .jsonl files for this task. | ||
| 134 | + --dev_filename DEV_FILENAME | ||
| 135 | + The dev filename. Should contain the .jsonl files for this task. | ||
| 136 | + --test_filename TEST_FILENAME | ||
| 137 | + The test filename. Should contain the .jsonl files for this task. | ||
| 138 | + --config_name CONFIG_NAME | ||
| 139 | + Pretrained config name or path if not the same as model_name | ||
| 140 | + --tokenizer_name TOKENIZER_NAME | ||
| 141 | + Pretrained tokenizer name or path if not the same as model_name | ||
| 142 | + --max_source_length MAX_SOURCE_LENGTH | ||
| 143 | + The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
| 144 | + --max_target_length MAX_TARGET_LENGTH | ||
| 145 | + The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
| 146 | + --do_train Whether to run training. | ||
| 147 | + --do_eval Whether to run eval on the dev set. | ||
| 148 | + --do_test Whether to run eval on the dev set. | ||
| 149 | + --do_lower_case Set this flag if you are using an uncased model. | ||
| 150 | + --no_cuda Avoid using CUDA when available | ||
| 151 | + --train_batch_size TRAIN_BATCH_SIZE | ||
| 152 | + Batch size per GPU/CPU for training. | ||
| 153 | + --eval_batch_size EVAL_BATCH_SIZE | ||
| 154 | + Batch size per GPU/CPU for evaluation. | ||
| 155 | + --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS | ||
| 156 | + Number of updates steps to accumulate before performing a backward/update pass. | ||
| 157 | + --learning_rate LEARNING_RATE | ||
| 158 | + The initial learning rate for Adam. | ||
| 159 | + --beam_size BEAM_SIZE | ||
| 160 | + beam size for beam search | ||
| 161 | + --weight_decay WEIGHT_DECAY | ||
| 162 | + Weight deay if we apply some. | ||
| 163 | + --adam_epsilon ADAM_EPSILON | ||
| 164 | + Epsilon for Adam optimizer. | ||
| 165 | + --max_grad_norm MAX_GRAD_NORM | ||
| 166 | + Max gradient norm. | ||
| 167 | + --num_train_epochs NUM_TRAIN_EPOCHS | ||
| 168 | + Total number of training epochs to perform. | ||
| 169 | + --max_steps MAX_STEPS | ||
| 170 | + If > 0: set total number of training steps to perform. Override num_train_epochs. | ||
| 171 | + --eval_steps EVAL_STEPS | ||
| 172 | + --train_steps TRAIN_STEPS | ||
| 173 | + --warmup_steps WARMUP_STEPS | ||
| 174 | + Linear warmup over warmup_steps. | ||
| 175 | + --local_rank LOCAL_RANK | ||
| 176 | + For distributed training: local_rank | ||
| 177 | + --seed SEED random seed for initialization | ||
| 178 | +``` | ||
| ... | \ No newline at end of file | ... | \ No newline at end of file |
-
Please register or login to post a comment