When collecting any data on the Arduino, it won’t be very long before you need to calculate some statistics on that data. While statistics can be a pretty intense mathematical field, some very basic statistics such as calculating the mean and standard deviation can be invaluable for many applications.

Fortunately, it is not only easy to make these calculations, but its usefulness can extend beyond just statistics. Many times data from a sensor may not be stable. Touch sensing is a good example. Each individual value (data point) can vary quite a bit, making it difficult to make an accurate determination. By averaging these values, a better decision can be made.

In a similar vein, by calculating the standard deviation, you can assess the quality of the values obtained. A large deviation from the mean can indicate problems with your sensor.

## Design Challenges

Because statistics often require some extensive data collection, normal methods can be too memory intensive for small microcontrollers like the Arduino or ATtiny based designs. Normally each value is stored and, when all the data is collected, various calculations such as the mean and standard deviation are calculated on the data set. Storing this much data can easily overrun the small amount of SRAM available.

To solve this problem, keeping running totals of the data can be employed, thus permitting the storage of a single value instead of a large array of values. One total is used for the mean, and another is used for the standard deviation. This is how it’s done.

## Theory

### The Mean

The arithmetic mean is the most common type of averaging a data set. It is calculated by summing the data set and dividing by the number of points in that set. If we were collecting a fixed set of values (data points), then it would be simple to just store the sum by adding each value as it is gathered, which is how we will do it.The wrinkle comes when we want to gather a running total. Put another way, we want to store some N values, but we want to know what the running average is for the latest N values collected. To do that, we need some means of sloughing off the oldest data point before we add the latest one. Since using a collection is ruled out, how do we do this for a running total?

The answer is quite simple once you see it. It is best illustrated with an example. Let’s say we are summing 10 values at any given time. We have collected 10 values, and now need to add an 11th. The sum is 540, which makes the mean equal to 54. To make room for the latest value, we will simply subtract the mean from the total. While the mean is not in all likelihood the oldest value, it is the best representation of it. After all, the mean is what we are after, so that is what we will subtract.

To put it into the simplest equation, we have the following:

where N is the number of samples.

In our example, this equation is expressed as:

for our 10 data point total.

There is one last wrinkle to consider. Until we have gathered at least N values, we don’t want to discard any collected data. Therefore, we will keep track of how many samples we have collected when less than N. When the current number of samples is less than N, we simply add the value to the total. When it is equal to N, we will perform the previous calculation:

if (currNumSamples >= N) total = total * (N - 1) / N; else currNumSamples++; // increment the current number total += value;

### The Standard Deviation

Now that we have stored the data necessary to calculate the mean, the next challenge is the standard deviation. The standard deviation is a useful measure of how closely grouped our data is around the mean. It is the square root of the variance, which is the value we will gather.

The variance is the sum of squares of the differences between the values and the mean:

where is the .

Since the variance is a sum, is looks like it might be possible to store a single total just like we did for the mean. The problem is the fact that we won’t have the mean until all the data is gathered.

The solution is to use a basic arithmetic property that we learned in grade school – the *associative property*. Specifically, for each , we can write equivalently:

Substituting into our sum, we can write it as:

If we look carefully at the last sum: , we see that we can simply substitute N*mean for it, which reduces the equation to:

Simplifying we have:

Putting it all together, we calculate our running total as:

varianceSum = varianceSum * (N - 1) / N; varianceSum += sqr(val);

and calculate the variance as:

Lastly, we can calculate the standard deviation by getting the square root of the variance.

## The Library

Putting it altogether into a handy class, it would look like the following:

class Statistics { public: Statistics(int numSamples); void addData(float val); float mean() const; float variance() const; float stdDeviation() const; };

### Using It

To use the class is simple:

Statistics stats(10); // 10 data points to be collected in a running fashion void loop() { stats.addData(digitalRead(pin)); // some more code ... float aveVal = stats.mean(); float quality = stats.stdDeviation(); }

The main decision to be made when using this class is how many data points to collect at a time. Since the memory storage is constant regardless of N, one might think to use a large N. In some cases that is acceptable. In others, there is a penalty for doing so. If for example we are trying to smooth some jittery data, the more points collected, the more stable the running mean. This stability also means that our response time to changing circumstances will be much slower.

It is similar to a low pass filter on your data. You can smooth out the steady state as much as you like, but at a cost of a longer settling time in the transient. The best way to determine the best sample size is to test different sizes in your application and choose the one that best manages the compromise between stability and responsiveness.

### Other considerations

One last consideration is the fact that the inclusion of floating point math consumes over 200 bytes of SRAM in overhead. For many chips, particularly ATtiny chips or other micros with small amounts of SRAM, this overhead is too much.

All is not lost. The same library can be implemented using integer arithmetic as well. This last point raises another prospect – can the library be implemented as a C++ template and then instantiated using the data store type of choice?It could, but usually applications fall into two categories – ones where the extra 200 or so bytes of ram is not a problem, or ones where every last byte is crucial. Therefore, two versions of the library are implemented – one for floating point and the other using integer arithmetic.

One last point – in the floating point case, should floats or doubles be used? Since both store any size quantity, either will work. Double precision provides only that advantage – precision. Since we are working with statistical quantities, precision is the last thing needed. Therefore, simple floats will suffice.

## Where to Get It

Both libraries are available at GitHub. Each library is completely implemented in just a header file. These libraries are small, fast and efficient. In addition to the mean, variance and standard deviation, the total, min, and max data values can also be retrieved.

## Conclusion

I have described a basic statistical library for the Arduino and other microcontrollers. It performs basic statistical calculations and is memory efficient.

Not only can it be used for gathering useful statistical information about sensory data, it can also be used to clean up dirty data such as that which comes from touch sensors. Constructed from a single class with only one member function used to store the data it is easy and fun to use.

Please offer your suggestions and bugs in the comments, and I will try to incorporate any improvements to the library. Alternatively, feel free to fork the library in Github and issue pull requests for any improvements you have made.

Mathematical expressions are rendered courtesy of QuickLaTeX.com

Statistics on the Arduino (also Pic or any microcontroller) by Provide Your Own is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

## 14 Comments