I assume you mean toolkit if you are interested in training or developing models.
It is hard to answer this generic question without the context. But it is not an easy implementation. Feature extraction is just a small but pretty simple part of ASR. Since some toolkits are open source, you may start from their source code. Develop a new toolkit from scratch may take some time.
If you just want to transcript audio, there are other libraries available. You can Google them easily. The competitive landscape can change so I usually don’t comment it because whatever I comment can be obsolete quickly.