Making a Guitar Tuner with HTML5

2014-11-01 math physics music guitar programming

The guitar tuner described in this article can be accessed at https://jbergknoff.github.io/guitar-tuner. The source code is available on github.

Nowadays, you can do all sorts of cool things with audio in web browsers. While it used to require flash to interact with the microphone or to play arbitrary sound, the web audio API has taken major strides. The API is still under heavy development and is sure to improve with time.

Adding sound to my web-based NES emulator is perhaps a more interesting topic for an article, but I’ll write on that another time. A simple guitar tuner is well-suited to a short, explicative article.

Microphone data

It’s very easy to get data from the microphone in a modern web browser:

navigator.getUserMedia({ "audio": true }, use_stream, function() {});

function use_stream(stream)
{
	var audio_context = new AudioContext();
	var microphone = audio_context.createMediaStreamSource(stream);
	// do something with the microphone stream...
}

We’ll need to improve on this, but the basic idea is encapsulated here. navigator.getUserMedia puts up a browser prompt, asking the user to give access to the microphone (and/or webcam), and createMediaStreamSource makes a stream out of that data. The stream can then be piped to all sorts of audio nodes to process it and, if desired, play it over the speakers.

The idea for a guitar tuner is straightforward: we want to listen to the user playing a guitar, and then say which note is being played. Really, I’d like it to be slightly better. There’s a lot of no-man’s-land between the notes in a chromatic scale, so I would prefer if the tuner could say “you’re near this note”, and indicate the direction (up or down) to tune the instrument correctly.

Extracting frequency information

In order to tell what frequency is being played, we will want to take something like a Fourier transform of the microphone data. There is a built-in AnalyserNode which can sort of do this for us, but I will avoid the canned routine for a couple of reasons. First, it is inflexible and too limited for this application (I will explain in the next section). Second, we have to do something of interest; using other people’s implementations can be practical and expedient, but there is usually little to be learned from it.

The waveform (i.e. the audio signal) is some complicated thing. We expect it to have one major pitch/frequency (because the user is plucking a guitar string), but it will always have additional stuff, like overtones, the hum of your ceiling fan, etc. The idea behind a Fourier decomposition is to break the complicated waveform into a sum of simple waveforms (sinusoids), each representing only one frequency. This involves computing a bunch of quantities that look like

$\sum_{t} ϕ (t) \sin (2 π f t),$

where $ϕ (t)$ represents sound amplitude as a function of time, and the sinusoid is a wave with some frequency $f$ .

What is the significance of this quantity? It gives a measure of how much $ϕ (t)$ looks like a signal with frequency $f$ . Consider that if $ϕ$ rises and falls in exactly the same way as the sinusoid, then the terms of the sum are a bunch of positive things ( $ϕ$ and $\sin$ go positive together, and go negative together, so their product is always positive). In other words, if $ϕ$ looks a lot like a signal with frequency $f$ , then the thing computed above is big in magnitude. On the other hand, if $ϕ$ is radically different, then the terms of the sum will be a lot of random junk that will tend to cancel out to zero (rather than pile up as all positive or all negative).

So if we want to know how much our microphone data resembles concert A (440 Hz), we can compute a correlation

$similarity to concert A = \sum_{t} ϕ (t) \sin (2 π \cdot 440 Hz \cdot t) .$

A textbook Fourier transform will compute these correlations for a set of evenly-spaced frequencies: 10 Hz, 20 Hz, 30 Hz, … . The notes on a chromatic scale, however, are distributed geometrically (according to $f_{n} = f_{0} \cdot 2^{n / 12}$ , where $f_{0}$ is some reference frequency, e.g. 440 Hz concert A, and $n$ is an integer number of semitones above or below the reference note). A Fourier transform would test too few frequencies in some places (around 100 Hz, where notes are spaced closely) and too many frequencies in other places (around 300 Hz, where notes are more spread out).

It is useful, therefore, to eschew the textbook Fourier transform and just sample a set of frequencies that will allow us to look for (1) the notes themselves, and (2) “a little too high” and “a little too low” for each note. It won’t be possible to reconstruct the original signal from the stuff we compute (i.e. this is lossy, while a real Fourier transform is not), but that doesn’t matter in this context.

Implementation notes

The source code of my implementation is available on github, and a usable version is online here. Rather than belaboring all the boring details, I will take a moment here to comment on some of the interesting parts.

I initially tried capturing audio with the built-in AnalyserNode, but I had issues detecting low frequencies with it. The longest length of time that the AnalyserNode can record at once is 2048 samples (approximately 50 milliseconds, sampled at the usual 44100 samples/second). The thick string of a guitar is typically tuned to an E note (around 82 Hz), and only a few periods of such a wave would fit in a 50 ms time window. A longer time window should make it easier to pick up low frequencies. I implemented a scheme to record for longer using the ScriptProcessor audio node. Now, detection of low frequencies is passable but still not great (the tuner often picks up the first overtone instead: on the low E string, this is a B. On the A string, this is an E). I am sure this aspect of the tuner could be improved if more effort was invested.

Even if I were to use the AnalyserNode for recording, I would not use its built-in Fourier transform because I am following the pseudo-Fourier procedure outlined in the previous section.
Rather than computing correlations with sine or with cosine, I use a complex exponential $e^{i x} = \cos x + i \sin x$ (i.e. compute with both sine and cosine, and then combine the results in a sensible way). This allows us to find correlations even when the waveform is out of phase with a pure sine or cosine waveform.
I use a Worker for computing the correlations because the computation is slow and I don’t want the UI thread to be unresponsive.
The script processor window size, 1024, was chosen somewhat haphazardly. A bad value can cause buffer underflow or overflow, but this one seemed to work okay for me. Tangentially related: there is an open chromium issue regarding script processors with small window size misbehaving on Linux.
I intentionally leak the capture_audio function into the window namespace in order to circumvent a bug in chromium which makes the script processor stop working after a few seconds.
I configure the script processor to play its output on the speakers, but never supply any output. This is a workaround for another bug in chromium. Without it, the script processor never raises audioprocess events.
The web audio API deals in typed arrays (e.g. Float32Array) rather than normal JavaScript arrays. In order to convert a typed array to a standard array, you can use Array.prototype.slice.
I wait 250 milliseconds between periods of consuming the microphone data. From the user’s perspective, getting output from the tuner four times per second is adequate, so there’s no sense in constantly spending the CPU time.
I identify the dominant frequency by just picking the one with maximum amplitude. This is simplistic but works reasonably well. I compare the magnitude of this dominant frequency to the average magnitude across all tested frequencies. If it’s not substantially stronger than average, then I ignore it.

Personally, I have no use for a tuner like the one built above. I am happy just having a reference tone play and tuning one string of my guitar to it. Because the web audio API makes it so easy, I added on a button to play an E tone.

I control the OscillatorNode by piping it through a GainNode and setting the gain to zero or non-zero. The OscillatorNode has start and stop methods, but they are apparently not intended for toggling it on and off (it didn’t work when I tried, but I didn’t look very much into this).

Summing up

This is a neat, simple project that touches on math, physics, music and programming. The final product is very basic, but it is reasonably useful. I hope this article inspires others to experiment with audio on the web.