Following our initial project to deliver a visitor display that allowed the public to experience elephants’ communication noises, we are working with Zoological Society of London (ZSL) Whipsnade on an exciting project to develop an automatic elephant call identification system using audio data and machine learning.
Elephants are social creatures that communicate with each other using many kinds of vocalisations. The frequencies of most of their calls are within human hearing range, but some of the vocalisations they make are at frequencies lower than 20Hz (e.g. infrasound), which is the lower limit of human hearing, and therefore inaudible to people. The range of the elephants’ infrasound vocalisations can extend for several kilometres, enabling them to communicate over long distances.
Machine learning methods have been used widely to train algorithms that can identify different species of birds in wildlife settings, through audio data captured by microphones placed in the field (e.g. forests, parks, urban environments, etc). Much research has also been undertaken on automatic detection of whales using audio signals. This kind of automatic identification helps ecologists vastly in conservation and monitoring endeavours by saving them hundreds of hours of manual audio clip labelling. To the best of our knowledge, our project is the only research that has conducted to date in order to develop an automatic elephant call identification system via audio. In addition, there is, currently, not a large enough elephant sound dataset on which to train the machine learning algorithms. Considering that elephants are highly social creatures, we are confident that there is a great benefit in developing such a system in order to shed more light on elephants’ mysterious lives and to help conservation efforts.
As part of the visitor display system, which visualises the live elephant communications, a microphone and camera were installed in the elephant enclosure to record data 24/7. After a few months, thousands of hours of raw audio data were collected, enough to start development of the machine learning algorithm.
The first step in our project is to identify “points of interest” in the audio. We define a point of interest as an audio section which is noticeably louder than most sections of an audio interval. In our case, these points of interest would consist of elephant sounds, as well other noise from the enclosure such as people talking, machinery, and objects being moved around.
To create the points of interest, we first divide our raw audio into one-minute clips. Then, we create a grayscale spectrogram of each one minute clip. A spectrogram is a visual representation of a signal, with time on x-axis and frequency on the y-axis. The energy of a certain frequency at a given time is defined by how intense the colour of that corresponding point is, where darker colours mean higher energies for the corresponding frequency and time. After creating the spectrogram image, we limit the maximum frequency shown on the spectrogram to 8kHz, since we are not interested in capturing sounds that have higher frequencies. Then, we apply Gaussian blurring on the image to reduce noise. Afterwards, we apply “median clipping,” which means that we set a pixel’s value to 1 if the pixel’s energy is higher than 4 times the median energy of its corresponding row and column, and 0 otherwise. This transforms our grayscale spectrogram into a binary spectrogram. Then we apply “binary closing” on the produced image, which fills in the small holes in objects. Moreover, we apply “binary dilation” to enlarge the objects to a small extent. Finally, we apply “median filter” to get rid of salt and pepper noise and remove objects that are smaller than 2500 continuous pixels. The resulting spectrogram after applying the mentioned operations gives us our points of interest; sounds that have significant energy, throughout a noteworthy range of frequencies and time. The figure below shows a spectrogram demonstrating these operations.

Spectrogram Analysis
After creating our points of interests, or “segments,” we draw bounding boxes around them in our original spectrogram in order to find the timestamps where the segments start and end. The figure below illustrates the bounding boxes on segments that were found on the figure above.

Segmented Spectrogram
The next step in our pre-processing is to extract the audio intervals that correspond to these segments from our raw audio files. From a month and a half’s worth of recordings, we extracted approximately 50,000 segments with a median length of ~1.2 seconds, thus significantly reducing the total hours of recordings we need to process in order to extract all elephant sounds present.
Now that we have our audio segments, it is time for labelling; after manually labelling ~10,000 segments, we have found out that ~400 of them correspond to elephant sounds whereas the rest corresponds a vast variety of noise including construction noise, people talking, rain, and colliding objects.
Next, we produce features to represent properties of each segment for training our algorithms. For each segment, we calculate Mel-Frequency Cepstrum Coefficients, root-mean-square error, spectral roll-off, zero crossing rate, and spectral contrast features, we calculate the delta values of these as well.
Finally, we train an extremely randomized trees algorithm using all positive examples and 1000 examples. We do not use all of the negative examples for training since using largely uneven classes for training skews the predictions of our algorithm towards the class that have the largest number of examples. After training, our current preliminary model produces a true positive rate of 86% and a false positive rate of 3% on a test set of 218 samples (38 of them being positive).
In order to increase the number of positive examples, we are now exploring data augmentation methods. Data augmentation refers to the creation of synthetic data from the original data by applying various transformation techniques to the original data, while preserving the semantic validity of the data points. We are now experimenting with data augmentation techniques such as time dilation, pitch shifting, and adding background noise to create this new synthetic data points from our original data points. We are confident that using additional augmented audio clips will increase our prediction performance.
The longer-term ambition of this exciting project is to attempt to interpret these calls, for example, by associating them with particular activities and behaviours. This could even help improve the welfare of animals in captivity, for example, by determining when an elephant is ill, but not displaying any visible symptoms.
This also has wider applications for industry. For example, we can apply machine learning algorithms to condition monitoring and predictive maintenance of industrial equipment. By analysing typical models of operation, we can use changes that fall outside normal parameters to provide an early indication of a potential fault; thus, highlighting a maintenance requirement before the equipment fails, reducing downtime, and cutting costs.