Bidirectional deep architecture for Arabic speech recognition

Citation:

Zerari N, Abdelhamid S, Bouzgou H, Raymond C. Bidirectional deep architecture for Arabic speech recognition. Open Computer Science (De Gruyter) [Internet]. 2019;9 :92–102.

Abstract:

Nowadays, the real life constraints necessitates
controlling modern machines using human intervention
by means of sensorial organs. The voice is one of the human
senses that can control/monitor modern interfaces.
In this context, Automatic Speech Recognition is principally
used to convert natural voice into computer text as
well as to perform an action based on the instructions
given by the human. In this paper, we propose a general
framework for Arabic speech recognition that uses Long
Short-Term Memory (LSTM) and Neural Network (Multi-
Layer Perceptron: MLP) classifier to cope with the nonuniform
sequence length of the speech utterances issued
fromboth feature extraction techniques, (1)Mel Frequency
Cepstral Coefficients MFCC (static and dynamic features),
(2) the Filter Banks (FB) coefficients. The neural architecture
can recognize the isolated Arabic speech via classification
technique. The proposed system involves, first, extracting
pertinent features from the natural speech signal
using MFCC (static and dynamic features) and FB. Next,
the extracted features are padded in order to deal with the
non-uniformity of the sequences length. Then, a deep architecture
represented by a recurrent LSTM or GRU (Gated
Recurrent Unit) architectures are used to encode the sequences
ofMFCC/FB features as a fixed size vector that will
be introduced to a Multi-Layer Perceptron network (MLP)
to perform the classification (recognition). The proposed
system is assessed using two different databases, the first
one concerns the spoken digit recognition where a comparison
with other related works in the literature is performed,
whereas the second one contains the spoken TV
commands. The obtained results show the superiority of
the proposed approach.

Publisher's Version