In today’s world, Artificial Intelligence (AI) and TorchServe  is widely used across a plethora of domains such as e-commerce, automobile engineering, medicine, smart farming, electronics, and cybersecurity. Model serving refers to the hosting of AI and machine learning (ML) models on the cloud or on-premises and exposing their functionality via application programming interfaces (APIs) that can be integrated into systems to become AI-enabled.  MLOps (Machine Learning Operations) is the set of practices focused on operationalizing, deploying, and maintaining machine learning models in production. 

There are several model-serving frameworks in use. Some examples include Kserve [1], Seldon Core [2], and BentoML [3]. In this blog, we will learn how to serve and deploy Akaike’s Detectron models (object detection models) using TorchServe [4], a model-serving framework for PyTorch [5], which is an open-source ML framework based on the Python programming language and the Torch library (an open-source ML library) [6]. We will use TorchServe to serve the trained transfer learning model and outline the steps that are to be followed prior to and after serving the model. 

Using TorchServe

Prior to TorchServe, Akaike had its own, custom model serving solution to serve PyTorch models. This required custom handlers for the models, a model server, a Docker container, a mechanism for accessibility over the network, and integration with the cluster orchestration system. With the launch of TorchServe in 2020, Akaike shifted to using it to serve its PyTorch models. TorchServe facilitates the deployment of PyTorch models in a performant manner, at scale, without the need for custom handlers. A few lines of code are all that you need,  to move from a trained model to production deployment!

Figure 1 shows the architecture of the TorchServe model serving framework.

Torchserve Architecture

Figure 1: TorchServe Architecture


Deploying a Model on TorchServe

To deploy a model we need to do the following:

  • Create a MAR (Model ARchive) file for the model to be deployed using the torch-model-archiver. The MAR file should include the initializing, pre-processing, inference, and post-processing steps, along with the model.
  • Once the MAR files of all your models are created, run these using the torch serve-CLI (TorchServe Command Line Interface).

Deploying the Detectron Model

To deploy the detectron model, we first need to create its MAR file. 

If you are new to detection, please use this link to gain a better understanding. Follow this notebook for creating a MAR file for detection.

Deploying on Docker Compose

To deploy on Docker Compose, create a docker-compose.yml file for building docker-compose

version: ‘3.8’ services:   torchserve:    build: .    ports:      – 9095:8085    command: torchserve –start –foreground –model-store model_store –models my-model-1=my-model-1.mar my-model-2=my-model-2.mar my-model-3=my-model-3.mar –ts-config    restart: always    volumes:      – ./model_store:/home/model_store
Content of docker -compose.yml

Typically if you are using celery you would have multiple services in docker-compose like a flower, Redis, web, and worker. Please see this to know more about dockerizing celery.

Explanation of fields in docker-compose.yml:

  • ports:
  •       – 9095:8085
  • Here 8085 is a torch-serving port for inferring models, and 9095 is a VM port. Both of them are connected, the client can utilize this API by hitting on the 9095 port of the VM.
  • command: torchserve –start –foreground –model-store model_store –models my-model-1=my-model-1.mar my-model-2=my-model-2.mar my-model-3=my-model-3.mar –ts-config
  • To make this service active permanently we add “–foreground”
  • You need to give the path of the model store using “–model-store [path of model store folder]”
  • You can host multiple models using “–models [model name]=[mar file of model]”
  • If you want to change the torchserve port or want to do some changes in the configuration of the hosting service. You can use –ts-config “–ts-config [path to]”
  • Content of
  • inference_address=
  • default_workers_per_model=1
  • Here, the inference address is changed to port 8085 and number of workers per model is changed to 1.

Testing the Model Serving

Use the steps below to execute and test your model.

  • After creating the MAR file, you need to run the following command: “torch serve — start — model-store model_store — models detection-model=detectron-model.mar”

Figure 2: Code to convert image to binary

Benefits of TorchServe

Deploying models through a custom model server requires converting them to appropriate formats, which is time-consuming and burdensome. Using TorchServe, we can simplify model deployment using a single servable file that also serves as the single source of truth and is easy to share and manage. To summarise, it comes with the following advantages:

  • Simultaneous and seamless handling of multiple requests
  • Compatibility with both CPU and GPU systems
  • Easy to use in production (results can be obtained through a single API call)
  • Performant and scalable (the number of works can be configured through a management API)
  • Loading and serving multiple models in parallel.


In this blog, we saw how TorchServe could be used to serve PyTorch models in an easy and consistent manner using the example of a detectron model. Being fully integrated with PyTorch, it is the recommended framework for serving PyTorch-based ML models in production environments. Both developers and operations engineers can leverage TorchServe to prepare models for production. With TorchServe, we can deploy models in a performant and scalable manner, without cumbersome code modification through the eager mode and TorchScript.


  1. Kserve Model serving using kserve – []
  2. Seldon Core – []
  3. BentoML: A Unified Model Serving Framework,

Edited By : Naga Vydyanathan