Paddle Serving: model-as-a-service! Triggered by a single command line, deployment finishes in 10 minutes

To bridge the gap between Paddle Serving and PaddlePaddle framework, we release the new service of PaddleServing: Model As A Service (MAAS) online in Github. With the help of the new service, when a PaddlePaddle model is trained, users now can obtain the corresponding inference model at the same time, making it possible to deploy the deep learning inference service online for any applications. PaddleServing has the following four key features:

Easy-to-use: To reduce the expenses of developing models for PaddlePaddle users, PaddleServing now provides APIs to connect with PaddlePaddle. Integrates with Paddle training pipeline seamlessly, most paddle models can be deployed with one line command.

Industrial level service: To fit the requirements from an industrial level, PaddleServing now supports large scale features including distributed sparse parameter indexing function, high concurrent communication capability, model management, online loading, and online A/B testing.

Extensible framework design: PaddleServing now supports C++、Python、Golang on the client-side. In the future, more programming language client-side will be supported. It is also very easy for users to deploy other machine learning libraries or frameworks into the online inference service, though the current version mainly focuses on the PaddlePaddle framework.

Supports from a high-performance search engine: The original Paddle Inference library is the only backend inference engine that supports the Paddle Serving. It has the following high-performance features: memory/video memory reuse, operator automatic fusion, TensorRT subgraph, and Paddle Lite subgraph calling. Figure 1 shows the process of sending a client request to server computing using Paddle Serving. During the whole process, basic communication is supported by Baidu-RPC service with high concurrency and low latency.


Figure 1: Paddle Serving workflow.

Model-as-a-service: inference service triggered with a single command line

MAAS, Model-as-a-service, indicates that once a model is trained, it can be directly deployed as an online inference service. PaddlePaddle has an API to save the trained model in a specific format, which can be deployed by a single command line from Paddle Serving. We present a tutorial here.

In this tutorial, we will show the usage of xx via a prediction model for house prices. The following command shows how to download and uncompress the trained model and save the corresponding configuration files into the uci_housing folder.

wget --no-check-certificate
tar -xzf uci_housing.tar.gz

It is required to install the paddle_serving_server module of Paddle Serving on the server-end. If not installed, you can install by the following commands depending on your machine hardware configuration:

pip install paddle_serving_server //Install CPU version of paddle_serving_server
pip install paddke_serving_server_gpu //Install GPU version of paddle_serving_server

Now it is time to start the inference service!

python -m paddle_serving_server.serve --model uci_housing_model/ --thread 10 --port 9292 --name uci

model: The directory of config and model files of the server-end.
thread:Number of threads.
port:Port number.
name:The name of the HTTP inference service. If not specified, an RPC service will be triggered.
If you see the following output, then you have started the service successfully.
* Running on (Press CTRL+C to quit)

When the service starts running, the URL then becomes in this tutorial. If one needs to request the service, then the user has to send the data via the HTTP protocol in a specific format to the server. After the computation is finished on the server-end, the predicted house prices would be returned to the user.

curl -H "Content-Type:application/json" -X POST -d '{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332], "fetch":["price"]}'

Saving trained models:

The described method works for general models that do not involved with complicated computation and data pre-processing. However, in terms of other models that require pre-processing, Paddle Serving has other easy methods.

We then need a client-end to help with the process, making it possible to process the data for the server-end. With only a few additional steps, one can still deploy the whole service in 10 minutes. In the following, we present a tutorial for Bert As Service as an example.

Building Bert As A Service in 10 minutes

Bert As A Service is a popular library in the Github community, providing the semantic embedding of a given sentence as the input. Bert model has been very attractive to NLP researchers due to its good performance on many public NLP tasks. If applying Bert embedding as input to the NLP task, the performance would be improved significantly. Now, with the help of Paddle Serving, you only need four steps to deploy the same service!

1. Saving Servable Models

Paddle Serving supports any models trained with Paddle Paddle. By specifying the input and output variable, one can save the trained models. To better present, we now load a trained Chinese Bert model bert_chinese_L-12_H-768_A-12 from PaddleHub, and run the following command to save the configuration files for future usage. The server and client configuration files are located in bert_seq20_model and bert_seq20_client respectively.

import paddlehub as hub
model_name = "bert_chinese_L-12_H-768_A-12"

# obtain model files
module = hub.Module(model_name)
#get input and output information, as well as the program
inputs, outputs, program = module.context(
trainable=True, max_seq_len=20)

#map the names of input and output
feed_keys = ["input_ids", "position_ids", "segment_ids",
"input_mask", ]
fetch_keys = ["pooled_output", "sequence_output"]
feed_dict = dict(zip(feed_keys, [inputs[x] for x in feed_keys]))
fetch_dict = dict(zip(fetch_keys, [outputs[x] for x in fetch_keys]))

#save the model and config file for serving, param1: saving directory, param2: client config file directory, param3: input dict, param4: output dict, param5: model program
import as serving_io
serving_io.save_model("bert_seq20_model", "bert_seq20_client",
feed_dict, fetch_dict, program)

2. Start the server

Run the following command to start the server, gpu_ids indicates the GPU ID.

python -m paddle_serving_server_gpu.serve --model bert_seq20_model --thread 10 --port 9292 --gpu_ids 0

If the server starts successfully, you would see the following message.

Server[baidu::paddle_serving::predictor::general_model:: GeneralModelServiceImpl] is serving on port=9292.

3. Client data preprocessing configuration

Paddle Serving contains many preprocessing modules for a variety of public datasets. We choose the ChineseBertReader class from paddle_serving_app for the Bert tutorial, making it possible to get the semantic embedding of a given Chinese sentence. Run the following command to install paddle_serving_app:

pip install paddle_serving_app

4. Configure the client to access the server to obtain inference results

import os
import sys
from paddle_serving_client import Client
from paddle_serving_app import ChineseBertReader

#define reader for data preprocessing
reader = ChineseBertReader()
#specify the inference results
fetch = ["pooled_output"]
#specify server IP address
endpoint_list = [""]
#client class
client = Client()

#load client config file
#connect to the server

#load the data, and send it to the server, finally print out the results
for line in sys.stdin:
feed_dict = reader.process(line)
result = client.predict(feed=feed_dict, fetch=fetch)

Now let’s prepare the Chinese sentences into txt format, say data.txt. Then we run the script to get the inference results.

cat data.txt | python

More examples are also available at

Increasing throughputs capability

As one of the fundamental metrics, throughput is important for the quality of the online service. We run the test of the Bert As Service suing Paddle Serving on 4 NVIDIA Tesla V100 GPUs. When applying the same batch size and number of threads, we show the comparing in Figure 2. It can be noticed that there is a significant increase on the throughputs when applying Paddle Serving. When the batch size is 128, throughputs outperform about 58.3%.


Figure 2:  Paddle Serving Throughputs Testing

Besides, Paddle Paddle also supports a variety of inference services from other tasks among different domains.

Visualization tool for performance Timeline

Paddle Serving provides Timeline, a performance visualization tool, which is able to visualize the process on the backend when the clients are started. As an example, the visualization of the Bert As Service can be seen in Figure 3, where bert_pre indicates the preprocessing stage, and client_infer shows the stage for sending requests and receiving results. The second row of each process shows the client elements of the timeline from the server.

fig3Figure 3 Timeline: A visualization tool for model performance

More links

Official website:
Paddle Serving:
Copyright: This article is translation of this article.

Published by Irene

Keep calm and update blog.

One thought on “Paddle Serving: model-as-a-service! Triggered by a single command line, deployment finishes in 10 minutes

  1. Hi,

    Thanks for this simple article.
    Which Cuda version and cudnn version do you have on your system.
    Because Im getting “Couldnt find” error while launching server


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: