Implemented as rest api server, with ability to be deployed on linux machine with gpu (can also work on cpu). can process almost any audio or video format, converting its audio into text transcription. it has the following abilities: text transcription for the submitted audio with high accuracy on our development set (90-98%) which corresponds to the state of the art on the market. single gpu node can process about 20x sound streams simultaneously. server software could be easily modified to meet any client needs. benefits model has the following benefits: accent agnostic design, single universal model - no need to switch the model depending on the origin of the audio stream as in competing solutions, which might be not versatile enough for production deploys; channel quality agnostic design - 8khz telephone and 16+khz broadcast processing is integrated into the single model without the loss of quality, increasing the versatility further; can be inferred in slower (2.5x slower for 4 additional hypothesis) mode with increased accuracy and additional versions of decoded text, which can be used for human-driven correction of the transcribe, leading to almost 100% top-5 accuracy in most cases; noise resilient, without the need of noise removal pre-pass for moderate amounts of noise; can be easily extended to unseen cases just by adding data into training pipeline, free of charge. voice activity detection built-in noise resilient voice activity detection (vad) which is using fast neural network. this solution has the ability to differentiate between speech and non-speech even in highly noisy environments, including background music, crowd and so on. can be exposed as a separate api for special purposes. speaker identification implementation. for 0.3 seconds frames has accuracy of 90% to differentiate between same or different speaker for broadcast quality signal. speaker vectors are stable between the runs and can be used to match the voice with the database of known speakers. transcribed text is automatically split between different speakers in api response.
There is no how it works explanation for this product
There are no references for this product yet