This bachelor thesis examines deep neural networks (DNNs) and their use in automatic speech recognition (ASR) systems. This is done to evaluate the feasibility of implementing such ASR systems in educational applications to help children with reading difficulties. To aid these children, educators need interactive software that employs automatic speech recognition (ASR) to provide necessary feedback to a child’s reading attempts. However, most reading applications used in schools employ outdated ASR based on statistical models which is unable to reach the level of accuracy seen in DNN-based speech recognition found in commercial applications.
Most ASR systems used with educational reading software employ a Gaussian Mixture Model (GMM) with a Hidden Markov Model (HMM) to perform acoustic modeling. One of the problems with GMMs is that they have difficulty modeling data constructs found on nonlinear manifolds, which is common with speech data. DNNs are more effective at modeling such constructs. This means the accuracy of ASR systems can be vastly improved by replacing the GMMs with DNNs. Furthermore, because of long short-term memory (LSTM) networks’ ability to represent temporal constructs and the existence of large libraries of labelled speech data, it is even possible to have DNNs replace the entire ASR pipeline. Research has shown that both above configurations outperform GMM-HMM ASR with the same amount of training data, with many DNN ASR systems even approaching parity with human transcribers. The goal of this paper is to examine published research and academic works to provide a theoretical background for implementing an educational reading application with DNN-based ASR. This application will be realized in the second part of this author’s bachelor thesis.