25
loading...
This website collects cookies to deliver better user experience
def match_command(
self, filename: str, pattern: str, language_code: str
) -> Tuple[str, int]:
info = sf.info(filename)
client = StreamingClient(info.samplerate, language_code)
for transcript in client.recognize(
self.generate_request_stream(filename, info.samplerate)
):
match = re.match(pattern, transcript)
if match:
self.closed = True
return transcript, current_time_ms() - self.command_end_ts
return None, None
pre_silence = np.random.choice(
[*range(chunk_size, samples_per_ms * 2000 + 1, chunk_size)], 1
)[0]
pre_silence_sent = 0
while pre_silence_sent <= pre_silence:
yield np.zeros(chunk_size, dtype="int16").tobytes()
pre_silence_sent += chunk_size
time.sleep(0.1)
time.sleep
every time. I'm using an excellent library Soundfile to open the file, strip WAVE headers and read samples. I'm also capturing a timestamp of when a successive chunk is about to be sent - we will need the last timestamp to calculate latency in the end.with sf.SoundFile(filename, mode="r") as wav:
while wav.tell() < wav.frames:
sound_for_recognition = wav.read(chunk_size, dtype="int16")
self.command_end_ts = current_time_ms()
yield sound_for_recognition.tobytes()
time.sleep(0.1)
100 - last_chunk_duration
milliseconds ago. Let's account for that:with sf.SoundFile(filename, mode="r") as wav:
while wav.tell() < wav.frames:
sound_for_recognition = wav.read(chunk_size, dtype="int16")
if sound_for_recognition.shape[0] < chunk_size:
last_chunk_size = sound_for_recognition.shape[0]
sound_for_recognition = np.concatenate(
(
sound_for_recognition,
np.zeros(
chunk_size - sound_for_recognition.shape[0],
dtype="int16",
),
)
)
self.command_end_ts = (
current_time_ms()
- (chunk_size - last_chunk_size) / samples_per_ms
)
else:
self.command_end_ts = current_time_ms()
yield sound_for_recognition.tobytes()
time.sleep(0.1)
while not self.closed:
yield np.zeros(chunk_size, dtype="int16").tobytes()
time.sleep(0.1)
match_command
function, it contains a listening loop for results - as soon as matching transcript arrives, we capture that timestamp, subtract the timestamp of when command pronunciation ended, and that will be our latency.