To bridge the gap between Paddle Serving and PaddlePaddle framework, we release the new service of PaddleServing: Model As A Service (MAAS) online in Github. With the help of the new service, when a PaddlePaddle model is trained, users now can obtain the corresponding inference model at the same time, making it possible to deploy the deep learning inference service online for any applications. PaddleServing has the following four key features:
Easy-to-use: To reduce the expenses of developing models for PaddlePaddle users, PaddleServing now provides APIs to connect with PaddlePaddle. Integrates with Paddle training pipeline seamlessly, most paddle models can be deployed with one line command.
Industrial level service: To fit the requirements from an industrial level, PaddleServing now supports large scale features including distributed sparse parameter indexing function, high concurrent communication capability, model management, online loading, and online A/B testing.
Extensible framework design: PaddleServing now supports C++、Python、Golang on the client-side. In the future, more programming language client-side will be supported. It is also very easy for users to deploy other machine learning libraries or frameworks into the online inference service, though the current version mainly focuses on the PaddlePaddle framework.
Supports from a high-performance search engine: The original Paddle Inference library is the only backend inference engine that supports the Paddle Serving. It has the following high-performance features: memory/video memory reuse, operator automatic fusion, TensorRT subgraph, and Paddle Lite subgraph calling. Figure 1 shows the process of sending a client request to server computing using Paddle Serving. During the whole process, basic communication is supported by Baidu-RPC service with high concurrency and low latency.
Figure 1: Paddle Serving workflow.
Model-as-a-service: inference service triggered with a single command line
MAAS, Model-as-a-service, indicates that once a model is trained, it can be directly deployed as an online inference service. PaddlePaddle has an API to save the trained model in a specific format, which can be deployed by a single command line from Paddle Serving. We present a tutorial here.
In this tutorial, we will show the usage of xx via a prediction model for house prices. The following command shows how to download and uncompress the trained model and save the corresponding configuration files into the uci_housing folder.
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz tar -xzf uci_housing.tar.gz
It is required to install the paddle_serving_server module of Paddle Serving on the server-end. If not installed, you can install by the following commands depending on your machine hardware configuration:
pip install paddle_serving_server //Install CPU version of paddle_serving_server pip install paddke_serving_server_gpu //Install GPU version of paddle_serving_server
Now it is time to start the inference service!
python -m paddle_serving_server.serve --model uci_housing_model/ --thread 10 --port 9292 --name uci
model: The directory of config and model files of the server-end.
thread:Number of threads.
port:Port number.
name:The name of the HTTP inference service. If not specified, an RPC service will be triggered.
If you see the following output, then you have started the service successfully.
* Running on http://0.0.0.0:9292/ (Press CTRL+C to quit)
When the service starts running, the URL then becomes http://127.0.0.1:9292/uci/prediction in this tutorial. If one needs to request the service, then the user has to send the data via the HTTP protocol in a specific format to the server. After the computation is finished on the server-end, the predicted house prices would be returned to the user.
curl -H "Content-Type:application/json" -X POST -d '{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
Saving trained models: https://github.com/PaddlePaddle/Serving/blob/a4d478d79e120229572bcd56a001688bd7e07b94/doc/SAVE.md
The described method works for general models that do not involved with complicated computation and data pre-processing. However, in terms of other models that require pre-processing, Paddle Serving has other easy methods.
We then need a client-end to help with the process, making it possible to process the data for the server-end. With only a few additional steps, one can still deploy the whole service in 10 minutes. In the following, we present a tutorial for Bert As Service as an example.
Building Bert As A Service in 10 minutes
Bert As A Service is a popular library in the Github community, providing the semantic embedding of a given sentence as the input. Bert model has been very attractive to NLP researchers due to its good performance on many public NLP tasks. If applying Bert embedding as input to the NLP task, the performance would be improved significantly. Now, with the help of Paddle Serving, you only need four steps to deploy the same service!
1. Saving Servable Models
Paddle Serving supports any models trained with Paddle Paddle. By specifying the input and output variable, one can save the trained models. To better present, we now load a trained Chinese Bert model bert_chinese_L-12_H-768_A-12 from PaddleHub, and run the following command to save the configuration files for future usage. The server and client configuration files are located in bert_seq20_model and bert_seq20_client respectively.
import paddlehub as hub model_name = "bert_chinese_L-12_H-768_A-12" # obtain model files module = hub.Module(model_name) #get input and output information, as well as the program inputs, outputs, program = module.context( trainable=True, max_seq_len=20) #map the names of input and output feed_keys = ["input_ids", "position_ids", "segment_ids", "input_mask", ] fetch_keys = ["pooled_output", "sequence_output"] feed_dict = dict(zip(feed_keys, [inputs[x] for x in feed_keys])) fetch_dict = dict(zip(fetch_keys, [outputs[x] for x in fetch_keys])) #save the model and config file for serving, param1: saving directory, param2: client config file directory, param3: input dict, param4: output dict, param5: model program import paddle_serving_client.io as serving_io serving_io.save_model("bert_seq20_model", "bert_seq20_client", feed_dict, fetch_dict, program)
2. Start the server
Run the following command to start the server, gpu_ids indicates the GPU ID.
python -m paddle_serving_server_gpu.serve --model bert_seq20_model --thread 10 --port 9292 --gpu_ids 0
If the server starts successfully, you would see the following message.
Server[baidu::paddle_serving::predictor::general_model:: GeneralModelServiceImpl] is serving on port=9292.
3. Client data preprocessing configuration
Paddle Serving contains many preprocessing modules for a variety of public datasets. We choose the ChineseBertReader class from paddle_serving_app for the Bert tutorial, making it possible to get the semantic embedding of a given Chinese sentence. Run the following command to install paddle_serving_app:
pip install paddle_serving_app
4. Configure the client to access the server to obtain inference results
import os import sys from paddle_serving_client import Client from paddle_serving_app import ChineseBertReader #define reader for data preprocessing reader = ChineseBertReader() #specify the inference results fetch = ["pooled_output"] #specify server IP address endpoint_list = ["127.0.0.1:9292"] #client class client = Client() #load client config file client.load_client_config("bert_seq20_client/serving_client_conf.prototxt") #connect to the server client.connect(endpoint_list) #load the data, and send it to the server, finally print out the results for line in sys.stdin: feed_dict = reader.process(line) result = client.predict(feed=feed_dict, fetch=fetch) print(result)
Now let’s prepare the Chinese sentences into txt format, say data.txt. Then we run the script to get the inference results.
cat data.txt | python bert_client.py
More examples are also available at https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert
Increasing throughputs capability
As one of the fundamental metrics, throughput is important for the quality of the online service. We run the test of the Bert As Service suing Paddle Serving on 4 NVIDIA Tesla V100 GPUs. When applying the same batch size and number of threads, we show the comparing in Figure 2. It can be noticed that there is a significant increase on the throughputs when applying Paddle Serving. When the batch size is 128, throughputs outperform about 58.3%.
Figure 2: Paddle Serving Throughputs Testing
Besides, Paddle Paddle also supports a variety of inference services from other tasks among different domains.
Visualization tool for performance Timeline
Paddle Serving provides Timeline, a performance visualization tool, which is able to visualize the process on the backend when the clients are started. As an example, the visualization of the Bert As Service can be seen in Figure 3, where bert_pre indicates the preprocessing stage, and client_infer shows the stage for sending requests and receiving results. The second row of each process shows the client elements of the timeline from the server.
Figure 3 Timeline: A visualization tool for model performance
More links
Official website:https://www.paddlepaddle.org.cn
Paddle Serving: https://github.com/PaddlePaddle/Serving/tree/v0.2.0
GitHub: https://github.com/PaddlePaddle/Paddle
Gitee: https://gitee.com/paddlepaddle/Paddle
Copyright: This article is translation of this article.
Hi,
Thanks for this simple article.
Which Cuda version and cudnn version do you have on your system.
Because Im getting “Couldnt find libnvinfer.so.7” error while launching server
LikeLike