Image-classification-fulltraining.ipynb 35.6 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# End-to-End Multiclass Image Classification Example\n",
    "\n",
    "1. [Introduction](#Introduction)\n",
    "2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)\n",
    "  1. [Permissions and environment variables](#Permissions-and-environment-variables)\n",
    "3. [Training the ResNet model](#Training-the-ResNet-model)\n",
    "4. [Deploy The Model](#Deploy-the-model)\n",
    "  1. [Create model](#Create-model)\n",
    "  2. [Batch transform](#Batch-transform)\n",
    "  3. [Realtime inference](#Realtime-inference)\n",
    "    1. [Create endpoint configuration](#Create-endpoint-configuration) \n",
    "    2. [Create endpoint](#Create-endpoint) \n",
    "    3. [Perform inference](#Perform-inference) \n",
    "    4. [Clean up](#Clean-up)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Welcome to our end-to-end example of distributed image classification algorithm. In this demo, we will use the Amazon sagemaker image classification algorithm to train on the [caltech-256 dataset](http://www.vision.caltech.edu/Image_Datasets/Caltech256/). \n",
    "\n",
    "To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prequisites and Preprocessing\n",
    "\n",
    "### Permissions and environment variables\n",
    "\n",
    "Here we set up the linkage and authentication to AWS services. There are three parts to this:\n",
    "\n",
    "* The roles used to give learning and hosting access to your data. This will automatically be obtained from the role used to start the notebook\n",
    "* The S3 bucket that you want to use for training and model data\n",
    "* The Amazon sagemaker image classification docker image which need not be changed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The method get_image_uri has been renamed in sagemaker>=2.\n",
      "See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.\n",
      "Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 906 ms, sys: 140 ms, total: 1.05 s\n",
      "Wall time: 10.3 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import boto3\n",
    "import re\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "\n",
    "role = get_execution_role()\n",
    "\n",
    "bucket = 'deeplens-sagemaker-yogaproject'\n",
    "\n",
    "training_image = get_image_uri(boto3.Session().region_name, 'image-classification')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data preparation\n",
    "Download the data and transfer to S3 for use in training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os \n",
    "import urllib.request\n",
    "import boto3\n",
    "\n",
    "\n",
    "\n",
    "# caltech-256\n",
    "s3_train_key = \"image-classification-full-training/train\"\n",
    "s3_validation_key = \"image-classification-full-training/validation\"\n",
    "s3_train = 's3://{}/{}/'.format(bucket, s3_train_key)\n",
    "s3_validation = 's3://{}/{}/'.format(bucket, s3_validation_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the ResNet model\n",
    "\n",
    "In this demo, we are using [Caltech-256](http://www.vision.caltech.edu/Image_Datasets/Caltech256/) dataset, which contains 30608 images of 256 objects. For the training and validation data, we follow the splitting scheme in this MXNet [example](https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/data/caltech256.sh). In particular, it randomly selects 60 images per class for training, and uses the remaining data for validation. The algorithm takes `RecordIO` file as input. The user can also provide the image files as input, which will be converted into `RecordIO` format using MXNet's [im2rec](https://mxnet.incubator.apache.org/how_to/recordio.html?highlight=im2rec) tool. It takes around 50 seconds to converted the entire Caltech-256 dataset (~1.2GB) on a p2.xlarge instance. However, for this demo, we will use record io format. \n",
    "\n",
    "Once we have the data available in the correct format for training, the next step is to actually train the model using the data. After setting training parameters, we kick off training, and poll for status until training is completed.\n",
    "\n",
    "## Training parameters\n",
    "There are two kinds of parameters that need to be set for training. The first one are the parameters for the training job. These include:\n",
    "\n",
    "* **Input specification**: These are the training and validation channels that specify the path where training data is present. These are specified in the \"InputDataConfig\" section. The main parameters that need to be set is the \"ContentType\" which can be set to \"rec\" or \"lst\" based on the input data format and the S3Uri which specifies the bucket and the folder where the data is present. \n",
    "* **Output specification**: This is specified in the \"OutputDataConfig\" section. We just need to specify the path where the output can be stored after training\n",
    "* **Resource config**: This section specifies the type of instance on which to run the training and the number of hosts used for training. If \"InstanceCount\" is more than 1, then training can be run in a distributed manner. \n",
    "\n",
    "Apart from the above set of parameters, there are hyperparameters that are specific to the algorithm. These are:\n",
    "\n",
    "* **num_layers**: The number of layers (depth) for the network. We use 101 in this samples but other values such as 50, 152 can be used. \n",
    "* **num_training_samples**: This is the total number of training samples. It is set to 15420 for caltech dataset with the current split\n",
    "* **num_classes**: This is the number of output classes for the new dataset. Imagenet was trained with 1000 output classes but the number of output classes can be changed for fine-tuning. For caltech, we use 257 because it has 256 object categories + 1 clutter class\n",
    "* **epochs**: Number of training epochs\n",
    "* **learning_rate**: Learning rate for training\n",
    "* **mini_batch_size**: The number of training samples used for each mini batch. In distributed training, the number of training samples used per batch will be N * mini_batch_size where N is the number of hosts on which training is run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 10 to 12 minutes per epoch on a p2.xlarge machine. The network typically converges after 10 epochs. However, to save the training time, we set the epochs to 2 but please keep in mind that it may not be  sufficient to generate a good model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200\n",
    "# For this training, we will use 18 layers\n",
    "num_layers = \"18\" \n",
    "# we need to specify the input image shape for the training data\n",
    "image_shape = \"3,224,224\"\n",
    "# we also need to specify the number of training samples in the training set\n",
    "# for caltech it is 15420\n",
    "num_training_samples = \"600\"\n",
    "# specify the number of output classes\n",
    "num_classes = \"3\"\n",
    "# batch size for training\n",
    "mini_batch_size =  \"64\"\n",
    "# number of epochs\n",
    "epochs = \"2\"\n",
    "# learning rate\n",
    "learning_rate = \"0.01\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training\n",
    "Run the training using Amazon sagemaker CreateTrainingJob API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training job name: JOB--2020-11-10-08-44-40\n",
      "\n",
      "Input Data Location: {'S3DataType': 'S3Prefix', 'S3Uri': 's3://deeplens-sagemaker-yogaproject/image-classification-full-training/train/', 'S3DataDistributionType': 'FullyReplicated'}\n",
      "CPU times: user 62.6 ms, sys: 4.15 ms, total: 66.8 ms\n",
      "Wall time: 1.19 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import time\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "\n",
    "s3 = boto3.client('s3')\n",
    "# create unique job name\n",
    "job_name_prefix = 'JOB'\n",
    "job_name = job_name_prefix + '-' + time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())\n",
    "training_params = \\\n",
    "{\n",
    "    # specify the training docker image\n",
    "    \"AlgorithmSpecification\": {\n",
    "        \"TrainingImage\": training_image,\n",
    "        \"TrainingInputMode\": \"File\"\n",
    "    },\n",
    "    \"RoleArn\": role,\n",
    "    \"OutputDataConfig\": {\n",
    "        \"S3OutputPath\": 's3://{}/{}/output'.format(bucket, job_name_prefix)\n",
    "    },\n",
    "    \"ResourceConfig\": {\n",
    "        \"InstanceCount\": 1,\n",
    "        \"InstanceType\": \"ml.p2.xlarge\",\n",
    "        \"VolumeSizeInGB\": 50\n",
    "    },\n",
    "    \"TrainingJobName\": job_name,\n",
    "    \"HyperParameters\": {\n",
    "        \"image_shape\": image_shape,\n",
    "        \"num_layers\": str(num_layers),\n",
    "        \"num_training_samples\": str(num_training_samples),\n",
    "        \"num_classes\": str(num_classes),\n",
    "        \"mini_batch_size\": str(mini_batch_size),\n",
    "        \"epochs\": str(epochs),\n",
    "        \"learning_rate\": str(learning_rate)\n",
    "    },\n",
    "    \"StoppingCondition\": {\n",
    "        \"MaxRuntimeInSeconds\": 360000\n",
    "    },\n",
    "#Training data should be inside a subdirectory called \"train\"\n",
    "#Validation data should be inside a subdirectory called \"validation\"\n",
    "#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)\n",
    "    \"InputDataConfig\": [\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": s3_train,\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"application/x-recordio\",\n",
    "            \"CompressionType\": \"None\"\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"validation\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": s3_validation,\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"application/x-recordio\",\n",
    "            \"CompressionType\": \"None\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "print('Training job name: {}'.format(job_name))\n",
    "print('\\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training job current status: InProgress\n",
      "Training job ended with status: Completed\n"
     ]
    }
   ],
   "source": [
    "# create the Amazon SageMaker training job\n",
    "sagemaker = boto3.client(service_name='sagemaker')\n",
    "sagemaker.create_training_job(**training_params)\n",
    "\n",
    "# confirm that the training job has started\n",
    "status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']\n",
    "print('Training job current status: {}'.format(status))\n",
    "\n",
    "try:\n",
    "    # wait for the job to finish and report the ending status\n",
    "    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)\n",
    "    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)\n",
    "    status = training_info['TrainingJobStatus']\n",
    "    print(\"Training job ended with status: \" + status)\n",
    "except:\n",
    "    print('Training failed to start')\n",
    "     # if exception is raised, that means it has failed\n",
    "    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']\n",
    "    print('Training failed with the following error: {}'.format(message))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training job ended with status: Completed\n"
     ]
    }
   ],
   "source": [
    "training_info = sagemaker.describe_training_job(TrainingJobName=job_name)\n",
    "status = training_info['TrainingJobStatus']\n",
    "print(\"Training job ended with status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you see the message,\n",
    "\n",
    "> `Training job ended with status: Completed`\n",
    "\n",
    "then that means training successfully completed and the output model was stored in the output path specified by `training_params['OutputDataConfig']`.\n",
    "\n",
    "You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the \"Jobs\" tab."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy The Model\n",
    "\n",
    "***\n",
    "\n",
    "A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the topic mixture representing a given document.\n",
    "\n",
    "This section involves several steps,\n",
    "\n",
    "1. [Create Model](#CreateModel) - Create model for the training output\n",
    "1. [Batch Transform](#BatchTransform) - Create a transform job to perform batch inference.\n",
    "1. [Host the model for realtime inference](#HostTheModel) - Create an inference endpoint and perform realtime inference."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Model\n",
    "\n",
    "We now create a SageMaker Model from the training output. Using the model we can create a Batch Transform Job or an Endpoint Configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The method get_image_uri has been renamed in sagemaker>=2.\n",
      "See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.\n",
      "Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DEMO-full-image-classification-model-2020-11-10-08-51-55\n",
      "s3://deeplens-sagemaker-yogaproject/JOB/output/JOB--2020-11-10-08-44-40/output/model.tar.gz\n",
      "arn:aws:sagemaker:us-east-1:304659765988:model/demo-full-image-classification-model-2020-11-10-08-51-55\n",
      "CPU times: user 93.2 ms, sys: 9.06 ms, total: 102 ms\n",
      "Wall time: 1.54 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "sage = boto3.Session().client(service_name='sagemaker') \n",
    "\n",
    "model_name=\"DEMO-full-image-classification-model\" + time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())\n",
    "print(model_name)\n",
    "info = sage.describe_training_job(TrainingJobName=job_name)\n",
    "model_data = info['ModelArtifacts']['S3ModelArtifacts']\n",
    "print(model_data)\n",
    "\n",
    "hosting_image = get_image_uri(boto3.Session().region_name, 'image-classification')\n",
    "\n",
    "primary_container = {\n",
    "    'Image': hosting_image,\n",
    "    'ModelDataUrl': model_data,\n",
    "}\n",
    "\n",
    "create_model_response = sage.create_model(\n",
    "    ModelName = model_name,\n",
    "    ExecutionRoleArn = role,\n",
    "    PrimaryContainer = primary_container)\n",
    "\n",
    "print(create_model_response['ModelArn'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch transform\n",
    "\n",
    "We now create a SageMaker Batch Transform job using the model created above to perform batch prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download images under /008.bathtub\n",
    "!wget -r -np -nH --cut-dirs=2 -P /tmp/ -R \"index.html*\" http://www.vision.caltech.edu/Image_Datasets/Caltech256/images/008.bathtub/\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_input = 's3://{}/image-classification-full-training/test/'.format(bucket)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Transform job name: image-classification-models-2020-11-10-08-52-35\n",
      "\n",
      "Input Data Location: s3://deeplens-sagemaker-yogaproject/image-classification-full-training/validation/\n"
     ]
    }
   ],
   "source": [
    "timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())\n",
    "batch_job_name = \"image-classification-models\" + timestamp\n",
    "request = \\\n",
    "{\n",
    "    \"TransformJobName\": batch_job_name,\n",
    "    \"ModelName\": model_name,\n",
    "    \"MaxConcurrentTransforms\": 16,\n",
    "    \"MaxPayloadInMB\": 6,\n",
    "    \"BatchStrategy\": \"SingleRecord\",\n",
    "    \"TransformOutput\": {\n",
    "        \"S3OutputPath\": 's3://{}/{}/output'.format(bucket, batch_job_name)\n",
    "    },\n",
    "    \"TransformInput\": {\n",
    "        \"DataSource\": {\n",
    "            \"S3DataSource\": {\n",
    "                \"S3DataType\": \"S3Prefix\",\n",
    "                \"S3Uri\": batch_input\n",
    "            }\n",
    "        },\n",
    "        \"ContentType\": \"application/x-image\",\n",
    "        \"SplitType\": \"None\",\n",
    "        \"CompressionType\": \"None\"\n",
    "    },\n",
    "    \"TransformResources\": {\n",
    "            \"InstanceType\": \"ml.c5.xlarge\",\n",
    "            \"InstanceCount\": 1\n",
    "    }\n",
    "}\n",
    "\n",
    "print('Transform job name: {}'.format(batch_job_name))\n",
    "print('\\nInput Data Location: {}'.format(s3_validation))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Created Transform job with name:  image-classification-models-2020-11-10-08-52-35\n",
      "Transform job ended with status: Completed\n"
     ]
    }
   ],
   "source": [
    "sagemaker = boto3.client('sagemaker')\n",
    "sagemaker.create_transform_job(**request)\n",
    "\n",
    "print(\"Created Transform job with name: \", batch_job_name)\n",
    "\n",
    "while(True):\n",
    "    response = sagemaker.describe_transform_job(TransformJobName=batch_job_name)\n",
    "    status = response['TransformJobStatus']\n",
    "    if status == 'Completed':\n",
    "        print(\"Transform job ended with status: \" + status)\n",
    "        break\n",
    "    if status == 'Failed':\n",
    "        message = response['FailureReason']\n",
    "        print('Transform failed with the following error: {}'.format(message))\n",
    "        raise Exception('Transform job failed') \n",
    "    time.sleep(30)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the job completes, let's inspect the prediction results. The accuracy may not be quite good because we set the epochs to 2 during training which may not be sufficient to train a good model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sample inputs: ['image-classification-full-training/test/', 'image-classification-full-training/test/1.one-minute-yoga-at-home-tree-pose-video-beginner.jpg']\n",
      "Sample output: ['image-classification-models-2020-11-10-08-52-35/output/1.one-minute-yoga-at-home-tree-pose-video-beginner.jpg.out', 'image-classification-models-2020-11-10-08-52-35/output/10.maxresdefault.jpg.out']\n",
      "Result: label - plank, probability - 0.9452986717224121\n",
      "Result: label - plank, probability - 0.5059212446212769\n",
      "Result: label - plank, probability - 0.908556342124939\n",
      "Result: label - plank, probability - 0.9969107508659363\n",
      "Result: label - tree, probability - 0.7288743853569031\n",
      "Result: label - plank, probability - 0.589160680770874\n",
      "Result: label - tree, probability - 0.7094720602035522\n",
      "Result: label - tree, probability - 0.6348884105682373\n",
      "Result: label - tree, probability - 0.9513751864433289\n",
      "Result: label - tree, probability - 0.9830145835876465\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('plank', 0.9452986717224121),\n",
       " ('plank', 0.5059212446212769),\n",
       " ('plank', 0.908556342124939),\n",
       " ('plank', 0.9969107508659363),\n",
       " ('tree', 0.7288743853569031),\n",
       " ('plank', 0.589160680770874),\n",
       " ('tree', 0.7094720602035522),\n",
       " ('tree', 0.6348884105682373),\n",
       " ('tree', 0.9513751864433289),\n",
       " ('tree', 0.9830145835876465)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from urllib.parse import urlparse\n",
    "import json\n",
    "import numpy as np\n",
    "\n",
    "s3_client = boto3.client('s3')\n",
    "object_categories = ['tree', 'plank']\n",
    "def list_objects(s3_client, bucket, prefix):\n",
    "    response = s3_client.list_objects(Bucket=bucket, Prefix=prefix)\n",
    "    objects = [content['Key'] for content in response['Contents']]\n",
    "    return objects\n",
    "\n",
    "def get_label(s3_client, bucket, prefix):\n",
    "    filename = prefix.split('/')[-1]\n",
    "    s3_client.download_file(bucket, prefix, filename)\n",
    "    with open(filename) as f:\n",
    "        data = json.load(f)\n",
    "        index = np.argmax(data['prediction'])\n",
    "        probability = data['prediction'][index]\n",
    "    print(\"Result: label - \" + object_categories[index] + \", probability - \" + str(probability))\n",
    "    return object_categories[index], probability\n",
    "\n",
    "inputs = list_objects(s3_client, bucket, urlparse(batch_input).path.lstrip('/'))\n",
    "print(\"Sample inputs: \" + str(inputs[:2]))\n",
    "\n",
    "outputs = list_objects(s3_client, bucket, batch_job_name + \"/output\")\n",
    "print(\"Sample output: \" + str(outputs[:2]))\n",
    "\n",
    "# Check prediction result of the first 2 images\n",
    "[get_label(s3_client, bucket, prefix) for prefix in outputs[0:10]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Realtime inference\n",
    "\n",
    "We now host the model with an endpoint and perform realtime inference.\n",
    "\n",
    "This section involves several steps,\n",
    "1. [Create endpoint configuration](#CreateEndpointConfiguration) - Create a configuration defining an endpoint.\n",
    "1. [Create endpoint](#CreateEndpoint) - Use the configuration to create an inference endpoint.\n",
    "1. [Perform inference](#PerformInference) - Perform inference on some input data using the endpoint.\n",
    "1. [Clean up](#CleanUp) - Delete the endpoint and model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Endpoint Configuration\n",
    "At launch, we will support configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way.\n",
    "\n",
    "In addition, the endpoint configuration describes the instance type required for model deployment, and at launch will describe the autoscaling configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from time import gmtime, strftime\n",
    "\n",
    "timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())\n",
    "endpoint_config_name = job_name_prefix + '-epc-' + timestamp\n",
    "endpoint_config_response = sage.create_endpoint_config(\n",
    "    EndpointConfigName = endpoint_config_name,\n",
    "    ProductionVariants=[{\n",
    "        'InstanceType':'ml.m4.xlarge',\n",
    "        'InitialInstanceCount':1,\n",
    "        'ModelName':model_name,\n",
    "        'VariantName':'AllTraffic'}])\n",
    "\n",
    "print('Endpoint configuration name: {}'.format(endpoint_config_name))\n",
    "print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Endpoint\n",
    "Next, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import time\n",
    "\n",
    "timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())\n",
    "endpoint_name = job_name_prefix + '-ep-' + timestamp\n",
    "print('Endpoint name: {}'.format(endpoint_name))\n",
    "\n",
    "endpoint_params = {\n",
    "    'EndpointName': endpoint_name,\n",
    "    'EndpointConfigName': endpoint_config_name,\n",
    "}\n",
    "endpoint_response = sagemaker.create_endpoint(**endpoint_params)\n",
    "print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the endpoint can be created. It may take sometime to create the endpoint..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# get the status of the endpoint\n",
    "response = sagemaker.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = response['EndpointStatus']\n",
    "print('EndpointStatus = {}'.format(status))\n",
    "\n",
    "\n",
    "# wait until the status has changed\n",
    "sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)\n",
    "\n",
    "\n",
    "# print the status of the endpoint\n",
    "endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = endpoint_response['EndpointStatus']\n",
    "print('Endpoint creation ended with EndpointStatus = {}'.format(status))\n",
    "\n",
    "if status != 'InService':\n",
    "    raise Exception('Endpoint creation failed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you see the message,\n",
    "\n",
    "> `Endpoint creation ended with EndpointStatus = InService`\n",
    "\n",
    "then congratulations! You now have a functioning inference endpoint. You can confirm the endpoint configuration and status by navigating to the \"Endpoints\" tab in the AWS SageMaker console.\n",
    "\n",
    "We will finally create a runtime object from which we can invoke the endpoint."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Perform Inference\n",
    "Finally, the customer can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "runtime = boto3.Session().client(service_name='runtime.sagemaker') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Download test image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!wget -O /tmp/test.jpg http://www.vision.caltech.edu/Image_Datasets/Caltech256/images/008.bathtub/008_0007.jpg\n",
    "file_name = '/tmp/test.jpg'\n",
    "# test image\n",
    "from IPython.display import Image\n",
    "Image(file_name)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Evaluation\n",
    "\n",
    "Evaluate the image through the network for inteference. The network outputs class probabilities and typically, one selects the class with the maximum probability as the final class output.\n",
    "\n",
    "**Note:** The output class detected by the network may not be accurate in this example. To limit the time taken and cost of training, we have trained the model only for a couple of epochs. If the network is trained for more epochs (say 20), then the output class will be more accurate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "\n",
    "with open(file_name, 'rb') as f:\n",
    "    payload = f.read()\n",
    "    payload = bytearray(payload)\n",
    "response = runtime.invoke_endpoint(EndpointName=endpoint_name, \n",
    "                                   ContentType='application/x-image', \n",
    "                                   Body=payload)\n",
    "result = response['Body'].read()\n",
    "# result will be in json format and convert it to ndarray\n",
    "result = json.loads(result)\n",
    "# the result will output the probabilities for all classes\n",
    "# find the class with maximum probability and print the class index\n",
    "index = np.argmax(result)\n",
    "object_categories = ['ak47', 'american-flag', 'backpack', 'baseball-bat', 'baseball-glove', 'basketball-hoop', 'bat', 'bathtub', 'bear', 'beer-mug', 'billiards', 'binoculars', 'birdbath', 'blimp', 'bonsai-101', 'boom-box', 'bowling-ball', 'bowling-pin', 'boxing-glove', 'brain-101', 'breadmaker', 'buddha-101', 'bulldozer', 'butterfly', 'cactus', 'cake', 'calculator', 'camel', 'cannon', 'canoe', 'car-tire', 'cartman', 'cd', 'centipede', 'cereal-box', 'chandelier-101', 'chess-board', 'chimp', 'chopsticks', 'cockroach', 'coffee-mug', 'coffin', 'coin', 'comet', 'computer-keyboard', 'computer-monitor', 'computer-mouse', 'conch', 'cormorant', 'covered-wagon', 'cowboy-hat', 'crab-101', 'desk-globe', 'diamond-ring', 'dice', 'dog', 'dolphin-101', 'doorknob', 'drinking-straw', 'duck', 'dumb-bell', 'eiffel-tower', 'electric-guitar-101', 'elephant-101', 'elk', 'ewer-101', 'eyeglasses', 'fern', 'fighter-jet', 'fire-extinguisher', 'fire-hydrant', 'fire-truck', 'fireworks', 'flashlight', 'floppy-disk', 'football-helmet', 'french-horn', 'fried-egg', 'frisbee', 'frog', 'frying-pan', 'galaxy', 'gas-pump', 'giraffe', 'goat', 'golden-gate-bridge', 'goldfish', 'golf-ball', 'goose', 'gorilla', 'grand-piano-101', 'grapes', 'grasshopper', 'guitar-pick', 'hamburger', 'hammock', 'harmonica', 'harp', 'harpsichord', 'hawksbill-101', 'head-phones', 'helicopter-101', 'hibiscus', 'homer-simpson', 'horse', 'horseshoe-crab', 'hot-air-balloon', 'hot-dog', 'hot-tub', 'hourglass', 'house-fly', 'human-skeleton', 'hummingbird', 'ibis-101', 'ice-cream-cone', 'iguana', 'ipod', 'iris', 'jesus-christ', 'joy-stick', 'kangaroo-101', 'kayak', 'ketch-101', 'killer-whale', 'knife', 'ladder', 'laptop-101', 'lathe', 'leopards-101', 'license-plate', 'lightbulb', 'light-house', 'lightning', 'llama-101', 'mailbox', 'mandolin', 'mars', 'mattress', 'megaphone', 'menorah-101', 'microscope', 'microwave', 'minaret', 'minotaur', 'motorbikes-101', 'mountain-bike', 'mushroom', 'mussels', 'necktie', 'octopus', 'ostrich', 'owl', 'palm-pilot', 'palm-tree', 'paperclip', 'paper-shredder', 'pci-card', 'penguin', 'people', 'pez-dispenser', 'photocopier', 'picnic-table', 'playing-card', 'porcupine', 'pram', 'praying-mantis', 'pyramid', 'raccoon', 'radio-telescope', 'rainbow', 'refrigerator', 'revolver-101', 'rifle', 'rotary-phone', 'roulette-wheel', 'saddle', 'saturn', 'school-bus', 'scorpion-101', 'screwdriver', 'segway', 'self-propelled-lawn-mower', 'sextant', 'sheet-music', 'skateboard', 'skunk', 'skyscraper', 'smokestack', 'snail', 'snake', 'sneaker', 'snowmobile', 'soccer-ball', 'socks', 'soda-can', 'spaghetti', 'speed-boat', 'spider', 'spoon', 'stained-glass', 'starfish-101', 'steering-wheel', 'stirrups', 'sunflower-101', 'superman', 'sushi', 'swan', 'swiss-army-knife', 'sword', 'syringe', 'tambourine', 'teapot', 'teddy-bear', 'teepee', 'telephone-box', 'tennis-ball', 'tennis-court', 'tennis-racket', 'theodolite', 'toaster', 'tomato', 'tombstone', 'top-hat', 'touring-bike', 'tower-pisa', 'traffic-light', 'treadmill', 'triceratops', 'tricycle', 'trilobite-101', 'tripod', 't-shirt', 'tuning-fork', 'tweezer', 'umbrella-101', 'unicorn', 'vcr', 'video-projector', 'washing-machine', 'watch-101', 'waterfall', 'watermelon', 'welding-mask', 'wheelbarrow', 'windmill', 'wine-bottle', 'xylophone', 'yarmulke', 'yo-yo', 'zebra', 'airplanes-101', 'car-side-101', 'faces-easy-101', 'greyhound', 'tennis-shoes', 'toad', 'clutter']\n",
    "print(\"Result: label - \" + object_categories[index] + \", probability - \" + str(result[index]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Clean up\n",
    "\n",
    "When we're done with the endpoint, we can just delete it and the backing instances will be released.  Run the following cell to delete the endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sage.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "conda_mxnet_p36",
   "language": "python",
   "name": "conda_mxnet_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}