Fork me on GitHub

Open tools and data for cloudless automatic speech recognition

Automatic Speech Recognition has made big leaps forward thanks to advances in Deep Learning in recent times. On specific tasks it's reaching human parity. The algorithms making this possible are published by researches and big tech companies' Deep Learning frameworks are open source.

Our mission is to make it as easy as possible to take advantage of these developments tailored to your business needs.

lockKeep Ownership of your Data

There are numerous benefits of having a custom Automatic Speech Recognition System that is tailored to your needs.

cloud_offNo Cloud

For one, no (audio) data ever has to leave your company or device - data which potentially can bear sensitive personal or corporate information.

settingsDomain Specific Models

Furthermore, an Automatic Speech Recognition System trained for your use case will probably perform better than a general system. Sometimes it's enough to train a domain specific language model, which can be trained in an unsupervised manner based on existing texts. Sometimes it's necessary to train an acoustic model. To this end a speech corpus consisting of audio data together with its manually created transcriptions is necessary.

copyrightStay in Control of your Model

You have complete control over what data goes into the training of your model, and how it should behave.

cropFine-tuned models for your environment

But not only the acoustic and language models can be tailored to your needs - the underlying model can also be adapted to existing constraints regarding computing power. Clearly, a full fledged server has more computing resources than a mobile computer, e.g. a Raspberry Pi.

attach_moneyCheaper than Cloud Services

Most cloud based Speech To Text systems charge per time unit of transcribed audio. The training of a custom Speech To Text system on the other hand requires an investment upfront. But in the long run a domain specific system can turn out to be cheaper.

Use Cases

The use cases for Speech To Text technology can be broadly devided into command and control and into large vocabulary transcription tasks.

Command & Control






Assistive Tech for Elderlies or Persons with Disabilities

Large Vocabulary Transcription


Generation of Subtitles


Transcription of Audio Archives




Transcription of Telephone Calls


Medical Documentation


Transcription of Meetings

What we have to offer


Customized Models

Choose among different speech and text corpora to create models tailored to your use case, e.g. for phone calls, distant microphone recordings, or television news.


Pronunciation Lexica

We have created and are expanding a German pronunciation lexicon containing more than 370k words and their phonetic representations. Furthermore an English Lexicon based on CMU Dict is also available.


Pre-trained ASR Models

Directly use pre-trained Kaldi or CMU Sphinx models for your own ASR projects - available here for English and German.


Grapheme to Phoneme Models

Use our scripts and pronunciation lexica to train grapheme to phoneme models or download and use a pre-trained model.


Supported Speech Corpora

The scripts form a processing pipeline that supports several speech corpora out of the box: VoxForge (English and German), German Speechdata Package V2, LibriSpeech, Forschergeist (German), and Zamia (German).


Grow Open Speech Corpora

Prepare LibriVox data for training acoustic models. Help grow open speech corpora by efficiently correcting existing transcripts of spontaneous speech for example from podcasts.


Original Speech Corpora

This project encompasses two original German speech corpora: Forschergeist and Zamia. The former contains spontaneous speech, the latter audio recordings for interactions between humans and smart speakers.


Embedded Speech Recognition

Use our pre-trained models for CMU Sphinx and Kaldi to do speech recognition on resource constraint systems, e.g. Raspberry Pi.


Remix Speech Corpora

Create noisy versions of training data or re-encode audio data to simulate different environments, e.g. noisy background or different telephone codecs.