In summer season 2019, I labored as a highschool intern for the ONNX AI crew at Microsoft and beloved engaged on varied tasks with the crew, together with the BERT textual content classification mannequin. Nonetheless, as a result of Covid-19, the Microsoft Internship Program for highschool college students was canceled in the summertime of 2020. This led two different former interns and I to succeed in again out to the AI crew, touchdown positions as open supply builders for the summer season.

At this time, I’d wish to share our work on two significant tasks, RoBERTa text-classification and DeepVoice3 text-to-speech fashions. I spent the summer season changing these fashions into the ONNX format and contributing them to the ONNX model zoo, a set of pre-trained, state-of-the-art ONNX fashions from group members.


RoBERTa is a Pure Language Processing (NLP) mannequin and an optimized model of BERT (Bidirectional Encoder Representations from Transformers). This transformer mannequin is a posh mannequin with a number of HEADs and functionalities. For my venture, I particularly labored with the RoBERTa-base mannequin with no HEAD and RoBERTa sentiment evaluation mannequin, coaching the bottom mannequin with the mannequin weights offered by HuggingFace. Because of RoBERTa’s advanced structure, coaching and deploying the mannequin may be difficult, so I accelerated the mannequin pipeline utilizing ONNX Runtime.

As you possibly can see within the following chart, ONNX Runtime accelerates inference time throughout a variety of fashions and configurations.

chart, bar chart

For the sentiment evaluation mannequin, I skilled it with the mannequin weights from the alibi datasets that use the Cornell movie-review dataset. As this mannequin used a special dataset aside from the one offered by HuggingFace, I confronted loads of points with coaching the mannequin. Coaching was accomplished over the course of two days, 1239 epochs. Since I didn’t have entry to a GPU, I skilled utilizing a CPU. I additionally generated the MCC (Matthews Correlation Coefficient) validation rating for the mannequin.

Matthews Correlation Coefficient validation score

I additionally had bother with the post-processing code for this mannequin. After researching and understanding the output produced by the mannequin, I used to be in a position to determine the code. I transformed each fashions into ONNX format. To be able to add these fashions to the zoo, I created a pull request,  containing the mannequin recordsdata, the Jupyter notebooks for inference and validation of the sentiment evaluation mannequin, and pre-processing and post-processing code to assist the person run the mannequin. Right here is the hyperlink to the RoBERTa mannequin within the zoo.


DeepVoice3 is a text-to-speech (TTS) mannequin, the place the enter is a sentence and the output is the audio of that sentence. At the moment, the ONNX mannequin zoo doesn’t have any speech and audio processing fashions, so I began work on DeepVoice3 and aimed to contribute an audio mannequin to the zoo.

Nonetheless, I confronted a number of points changing this TTS mannequin to ONNX. The primary problem was the enter that the ONNX exporter allowed. The exporter solely allowed a three-parameter enter that was generated over an array of sentences. However the unique PyTorch mannequin required an enter of two parameters of options generated over every particular person sentence.

To check it, I transformed the mannequin utilizing the three parameters enter. As a result of enter distinction, the transformed ONNX mannequin didn’t produce the fitting audio output. I studied the DeepVoice3 mannequin structure to grasp how the encoder and decoder strategies work and went via the take a look at recordsdata offered by the DeepVoice3 GitHub.

Text to speech model diagram

Throughout my analysis of different TTS fashions that have been efficiently transformed to ONNX, I found that previously there have been some points concerning changing TTS fashions to ONNX, particularly due to the decoder methodology that TTS fashions use. I questioned if DeepVoice3 was appropriate with ONNX and filed a difficulty within the PyTorch repository concerning the conversion error. I’ve been speaking to some engineers to provide you with an answer.

Nonetheless, I nonetheless wasn’t certain if the issue was that the mannequin was not appropriate with ONNX or that there was a difficulty throughout the Pytorch mannequin itself. I talked to some engineers throughout the ONNX crew. After discussing, we agreed that there have been a number of points throughout the mannequin, particularly how the Decoder methodology was constructed that made it tough to export the fitting ONNX mannequin.

Including a text-to-speech mannequin to the ONNX mannequin zoo would be the topic of ongoing exploration.


Having the ability to add the RoBERTa mannequin to the ONNX mannequin zoo provides customers of the zoo extra alternatives to make use of pure language processing (NLP) of their AI purposes, with the additional predictive energy that RoBERTa gives. Personally, I gained a greater understanding of the NLP area and transferred this information into the supplies within the zoo, by the use of the notebooks, in addition to pre and submit processing code. As a brand new open supply developer, I discovered rather a lot about git, pull requests, and the GitHub group evaluate course of, which can stand me in good stead for future open supply contributions.

I’m grateful to my mentors Vinitra Swamy and Natalie Kershaw who’ve devoted their time for weekly conferences and guided me all through the venture.

Questions or suggestions about these tasks? Tell us within the feedback beneath. 












Leave a Reply

Your email address will not be published. Required fields are marked *