Paper ID | HLT-16.1 |
Paper Title |
Recent Advances in Arabic Syntactic Diacritics Restoration |
Authors |
Yasser Hifny, University of Helwan, Egypt |
Session | HLT-16: Applications in Natural Language |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Human Language Technology: [HLT-STPA] Segmentation, Tagging, and Parsing |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Restoring Arabic syntactic diacritics based on Long Short-Term Memory (LSTM) networks leads to state-of-the-art performance. These LSTM networks are commonly augmented with Maximum Entropy (MaxEnt) sparse direct connections between the input and the output layers of the tagger. One way to improve such tagger performance is to use an ensemble of taggers. However, an ensemble of taggers may require huge computational and memory resources. In this paper, we implement a knowledge distillation technique where an ensemble of teachers/taggers is used to train a single student tagger. On the other hand, Arabic is a morphologically rich language and has a high Out-Of-Vocabulary (OOV) rate. In addition to word embeddings, we propose to use character embeddings encoded using LSTMs for each word to overcome this problem. On the Arabic tree bank task, our hybrid LSTM/MaxEnt tagger achieves 1.0% absolute WER improvement over a strong baseline using the proposed two techniques. |