理解嵌入式系统中基本的语音算法

转至 https://www.embedded.com/print/

Speech processing mainly involves compression/decompression, recognition, conditioning and enhancement algorithms. Signal-processing algorithms count on system resources like available memory and clock. As these resources relate directly to system cost, they’re often prohibitive.

Measuring an algorithm’s complexity is the first step in analyzing the algorithm. This includes looking at the clocks required, and determining the algorithm’s processing load, which can vary based on the processor employed. However, the memory requirements would not change based on the processor.

Most DSP algorithms work on collections of samples, better known as frames (Fig. 1). This introduces an inevitable delay due to frame collection that’s in addition to the processing delay. Note that the International Telecommunication Union (ITU) standardizes the acceptable delay for an algorithm.

在这里插入图片描述

Looking at the audio spectrum, basic telephone-quality speech occurs up to 4 kHz. High-quality speech reaches 7 kHz, followed by CD-quality audio.
An algorithm’s processing load is typically represented in millions of clocks per second (MCPS), which is the number of clocks/s from the core that an algorithm would need. Assume an algorithm that processes a frame of 64 samples at 8 kHz, and requires 300,000 clocks to process each frame. The time required to collect the frame would be 64/8000, or 8 ms. Or, in 1 second, 125 frames could be collected. To process all the frames, the algorithm would consume 300,000 × 125 = 37,500,000 clocks/s, represented as 37.5 MCPS. Simplifying, the MCPS equation is:

MCPS = (clock required to execute one frame × sampling frequency/frame size)/1 million

Note that there’s another common term used for measuring an algorithm’s processing load—MIPS (million instructions/s). The calculation of MIPS for an algorithm can be tricky. If the processor effectively executes one instruction in one cycle, the MCPS and MIPS ratings for that processor are the same. Analog Devices’ BlackFin is one such processor. Otherwise, if the processor takes more than one cycle to execute an instruction, a ration exists between the MCPS and MIPS ratings. For example, ARM7TDMI processor effectively requires 1.9 cycles/instruction.

The memory considerations for any algorithm are typically separated between code (read-only) and data (read-write) memory. The proper memory amount can be found by compiling the source code. Note that algorithms perform at their best when using the fastest memory, and this is usually memory that’s internal to the core.

Integrating an algorithm on an existing system is somewhat easier. If the system is in devolvement phase, it’s recommended to test the audio front-end thoroughly before integrating of evaluating any algorithms. Within the system, you must verify that no interrupts are contradicting with each other. If such an issue were to exist, debugging can be a painful experience.

In a system that will incorporate audio/speech algorithms, robust audio firmware is a must. It must give the maximum time and accurate data to the algorithms to perform efficiently. One common mistake is to interrupt the core upon each sample’s arrival. If the algorithm operates only on the frames of a fixed number of samples, other interrupts are redundant. DMAs and internal FIFOs can collect samples and interrupt the core after collecting a frame.

Checking the signal levels, adjusting the hardware codec gains, synchronizing the near and far-end interrupts, verifying the DMA function, or any other experiment can be accomplished using this basic telephony standard. During this process, don’t be surprised to find that the received compressed data is in bit-reversed manner. A simple “bit-reversing” code will bring it back to the expected state. Any wideband speech codec could be used as an example of a speech algorithm that’s heavy in terms of memory and clock consumption. One example is sub-band ADPCM (adaptive differential PCM), or G.722, which operates on data sampled on 16 kHz and thus covers entire speech spectrum. It retains the unvoiced frequency components, those between 4 and 7 kHz that provide high-quality natural speech.

Before any codec is integrates into a system, I recommend that the designer do careful testing. While G.711 encoding and decoding can be tested on a sample-by-sample basis, codecs that involve filters and other frequency-domain algorithms are tested differently, using a stream of at least few thousand samples. The codec verification engages the engineer in unit testing with ITU vectors, signal-level testing, and interoperability testing with other available codecs. Interoperability issues related to arranging the encoded data in 16-bit word before transmitting and mismatching in signal levels aren’t new to system integration engineers.

Algorithms that require lots of memory and clock cycles have a big impact on the system, unlike those that have been discussed so far. The more compute-intensive algorithms include echo cancellers, noise suppressors, and Viterbi algorithms. Evaluating the performance of these is not as easy as the speech codecs.

Generally, any telecomm systems that involve a hands-free or speaker mode employ an acoustic echo canceller. This prevents the second party from hearing his own voice as an echo. If operated in a noisy environment, a noise-control algorithm may be needed. The echo canceller-noise reducer (EC-NR) demands lots of memory and clocks from the system. Time- and frequency-domain techniques can help solve the acoustic echo problem, with frequency-domain techniques proven to be more efficient with less computational cost (Table 1).

A frequency-domain technique uses an adaptive FIR filter to update its coefficients only when it finds that the residual echo error is larger than the threshold. Subtracting the estimated echo from the input signal gives the error. The far-end signals are used as a reference to these algorithms to estimate the echo. Providing a proper reference is needed to get good echo estimation and cancellation.

Another factor, echo tail length, is the echo reverberation time measured in milliseconds. Simply put, it’s the time spent in echo formation. The filter length is found by multiplying the echo tail length by the sampling frequency (Table 2).

在这里插入图片描述
Table 2
One of the basic requisites for an error-correction (EC) implementation is to support the data sampled until at least 16 kHz to ensure that wideband speech is covered. Integrating EC with wideband speech codecs requires some attention. As the echo tail length depends on the sampling frequency, canceling echo up to 72 ms with data sampled at 8 kHz will effectively cancel only half of the span when applied on the 16-kHz sampled data. And compared to 8 kHz, collecting a frame takes only half the time. Hence, engineers find integrating a half-effective EC with wideband codecs doubly challenging. Designers often raise the core frequency to efficiently manage EC on the system with a 16-kHz sampling rate.

Noise-reduction techniques have been used for many years. Depending on the application, an approach is chosen, implemented, and applied. For example, a technique could consider noise as more stationary then the human speech. The algorithm will model the noise and then subtract it from the input signal. A decay of 10 to 30 dB is significant for some applications. A common application that uses EC noise reduction could be when a handset is placed in speaker mode in a noisy environment or when hands-free mode is enabled in the car (Fig. 2).

The EC tail length requirement for the hands-free application is about 50 ms and the NR level required can vary from 12 to 25 dB, depending on the noise attributes and expected voice quality. Generally, the higher the noise reduction, the more the speech quality is put at risk. Hence, a level can be selected dynamically to give a reasonable reduction while still maintaining the proper voice quality.

The EC noise reduction can require up to 15 or 20 kbytes of system memory. The processing of each 64-sample frame can consume from 1.5 to 3.0 Mclocks, depending on the processor. Evaluating the performance of this combination can be tricky. The steps include tuning the hardware codec gains; finding the correct microphone and speaker placement; finding the synchronization between far and near-end speech and interrupts; finding audio hardware with linear attributes; and testing various EC tail lengths and noise reduction levels to achieve the best echo cancellation and noise reduction.

It’s important to consider worst cases when evaluating the complexity of any algorithm. An algorithm’s execution time can vary for different frames. This data dependency is due to the fact that a processor might take more time to multiply two samples of higher amplitude than multiplying samples of lesser amplitude.

An example of being cheated with adaptive algorithms comes when you observe the cycles consumed for a few frames, when the filter coefficients have not been updated. Adaptation of filter data can take several thousand cycles, which must be considered. A word of caution—don’t rely solely on the algorithm. Experimenting with a variety of vectors will help increase the accuracy of MCPS and performance measurements.

发布者：全栈程序员-站长，转载请注明出处：https://javaforall.net/225741.html原文链接：https://javaforall.net

理解嵌入式系统中基本的语音算法

关于作者

全栈程序员-站长

发表回复

理解嵌入式系统中基本的语音算法

关于作者

全栈程序员-站长

相关推荐

数据可视化与大数据分析

关于java的JIT知识

VMware Ubuntu安装详细过程（详细图解）

浅谈 &0xFF操作

ostringstream的使用方法

Pytest（1）安装与入门[通俗易懂]

发表回复