When collecting any data on the Arduino, it won’t be very long before you need to calculate some statistics on that data. While statistics can be a pretty intense mathematical field, some very basic statistics such as calculating the mean and standard deviation can be invaluable for many applications.

Fortunately, it is not only easy to make these calculations, but its usefulness can extend beyond just statistics. Many times data from a sensor may not be stable. Touch sensing is a good example. Each individual value (data point) can vary quite a bit, making it difficult to make an accurate determination. By averaging these values, a better decision can be made.

In a similar vein, by calculating the standard deviation, you can assess the quality of the values obtained. A large deviation from the mean can indicate problems with your sensor.

## Design Challenges

Because statistics often require some extensive data collection, normal methods can be too memory intensive for small microcontrollers like the Arduino or ATtiny based designs. Normally each value is stored and, when all the data is collected, various calculations such as the mean and standard deviation are calculated on the data set. Storing this much data can easily overrun the small amount of SRAM available.

To solve this problem, keeping running totals of the data can be employed, thus permitting the storage of a single value instead of a large array of values. One total is used for the mean, and another is used for the standard deviation. This is how it’s done.

## Theory

### The Mean

The arithmetic mean is the most common type of averaging a data set. It is calculated by summing the data set and dividing by the number of points in that set. If we were collecting a fixed set of values (data points), then it would be simple to just store the sum by adding each value as it is gathered, which is how we will do it.The wrinkle comes when we want to gather a running total. Put another way, we want to store some N values, but we want to know what the running average is for the latest N values collected. To do that, we need some means of sloughing off the oldest data point before we add the latest one. Since using a collection is ruled out, how do we do this for a running total?

The answer is quite simple once you see it. It is best illustrated with an example. Let’s say we are summing 10 values at any given time. We have collected 10 values, and now need to add an 11th. The sum is 540, which makes the mean equal to 54. To make room for the latest value, we will simply subtract the mean from the total. While the mean is not in all likelihood the oldest value, it is the best representation of it. After all, the mean is what we are after, so that is what we will subtract.

To put it into the simplest equation, we have the following:

where N is the number of samples.

In our example, this equation is expressed as:

for our 10 data point total.

There is one last wrinkle to consider. Until we have gathered at least N values, we don’t want to discard any collected data. Therefore, we will keep track of how many samples we have collected when less than N. When the current number of samples is less than N, we simply add the value to the total. When it is equal to N, we will perform the previous calculation:

if (currNumSamples >= N) total = total * (N - 1) / N; else currNumSamples++; // increment the current number total += value;

### The Standard Deviation

Now that we have stored the data necessary to calculate the mean, the next challenge is the standard deviation. The standard deviation is a useful measure of how closely grouped our data is around the mean. It is the square root of the variance, which is the value we will gather.

The variance is the sum of squares of the differences between the values and the mean:

where is the .

Since the variance is a sum, is looks like it might be possible to store a single total just like we did for the mean. The problem is the fact that we won’t have the mean until all the data is gathered.

The solution is to use a basic arithmetic property that we learned in grade school – the *associative property*. Specifically, for each , we can write equivalently:

Substituting into our sum, we can write it as:

If we look carefully at the last sum: , we see that we can simply substitute N*mean for it, which reduces the equation to:

Simplifying we have:

Putting it all together, we calculate our running total as:

varianceSum = varianceSum * (N - 1) / N; varianceSum += sqr(val);

and calculate the variance as:

Lastly, we can calculate the standard deviation by getting the square root of the variance.

## The Library

Putting it altogether into a handy class, it would look like the following:

class Statistics { public: Statistics(int numSamples); void addData(float val); float mean() const; float variance() const; float stdDeviation() const; };

### Using It

To use the class is simple:

Statistics stats(10); // 10 data points to be collected in a running fashion void loop() { stats.addData(digitalRead(pin)); // some more code ... float aveVal = stats.mean(); float quality = stats.stdDeviation(); }

The main decision to be made when using this class is how many data points to collect at a time. Since the memory storage is constant regardless of N, one might think to use a large N. In some cases that is acceptable. In others, there is a penalty for doing so. If for example we are trying to smooth some jittery data, the more points collected, the more stable the running mean. This stability also means that our response time to changing circumstances will be much slower.

It is similar to a low pass filter on your data. You can smooth out the steady state as much as you like, but at a cost of a longer settling time in the transient. The best way to determine the best sample size is to test different sizes in your application and choose the one that best manages the compromise between stability and responsiveness.

### Other considerations

One last consideration is the fact that the inclusion of floating point math consumes over 200 bytes of SRAM in overhead. For many chips, particularly ATtiny chips or other micros with small amounts of SRAM, this overhead is too much.

All is not lost. The same library can be implemented using integer arithmetic as well. This last point raises another prospect – can the library be implemented as a C++ template and then instantiated using the data store type of choice?It could, but usually applications fall into two categories – ones where the extra 200 or so bytes of ram is not a problem, or ones where every last byte is crucial. Therefore, two versions of the library are implemented – one for floating point and the other using integer arithmetic.

One last point – in the floating point case, should floats or doubles be used? Since both store any size quantity, either will work. Double precision provides only that advantage – precision. Since we are working with statistical quantities, precision is the last thing needed. Therefore, simple floats will suffice.

## Where to Get It

Both libraries are available at GitHub. Each library is completely implemented in just a header file. These libraries are small, fast and efficient. In addition to the mean, variance and standard deviation, the total, min, and max data values can also be retrieved.

## Conclusion

I have described a basic statistical library for the Arduino and other microcontrollers. It performs basic statistical calculations and is memory efficient.

Not only can it be used for gathering useful statistical information about sensory data, it can also be used to clean up dirty data such as that which comes from touch sensors. Constructed from a single class with only one member function used to store the data it is easy and fun to use.

Please offer your suggestions and bugs in the comments, and I will try to incorporate any improvements to the library. Alternatively, feel free to fork the library in Github and issue pull requests for any improvements you have made.

Mathematical expressions are rendered courtesy of QuickLaTeX.com

## 11 Comments

Very Informative write up….. Appreciated

After reading through the integer version, I would describe this as a decent and definitely compact implementation. For well-behaved, size char mean and variance it should work.

Note that m = m *((n-1)/n ) will not actually discard old data. Instead, it scales it so that the value you expect to store remains approximately bounded by N*mean for the mean, and N*(mean^2) for the sum of squares for the variance. If you put in a window length of two, and try the sequence {10, 10, 10, 20, 20, 20} you will see that you get a final number of 15, not the 2-observation mean of 20. If you used two pairs of accumulators you could implement a window that is constrained between N and 2N, however, by switching accumulators whenever the count reaches N, clearing the “new” accumulator , and always using both together for output. This would give you the long-term memoryless property that you desire.

The “well behaved” warning is because of large N*value – particularly on readings that can vary widely. Overflow errors would wipe out the accuracy of all subsequent readings, and are reasonably likely to happen if there is respectable variance and auto-correlated errors.

Why would I notice all this? Why, by burning myself while making a rate adaptive Morse decoder of course.

Correcting myself on the widowing: you do get exponential decay of the effect of out-of-window values, and the results in my example are not 15… Not 20 either (depending on truncation/rounding algorithm, should be about 18), but close enough for horseshoes and hand grenades.

You raise some good points. The data is accumulative by design, and discarding the data and starting over would be one way of changing that behavior. It turns out that old data does not really hang around forever though, but is sloughed off over time. In the example you provide, the mean should be 18.75 not 15. As you continue to add more 20′s to the accumulated data, the mean will approach 20. By choosing the appropriate window size you can cycle over to new data values as fast or slow as you would like. Put another way, if you want your running average to adapt quickly to new values, use a small sample size. If you want it to adapt more slowly, use larger sample size. The sample size does not change the storage, but how many of the last samples are accumulated.

Sample size affects the effect of bad data as well. A small sample size will reflect the bad data more readily, but would also clear it out more quickly as well. A larger sample size keeps the bad data points around longer, but they will have less of an effect.

I do not think the AVR compiler supports doubles and just implements them as floats.

Hey – Just wanted to say thanks for making this library available – I found it through Hackaday and realized it would help me elegantly solve one of the issues with a sound-reactive project I’m working on.

Here’s a video of one of my experiments that is using your Statistics library to efficiently smooth out the readings from a spectrum analyzer IC to make a nice, flicker-free light show on a series of LEDs: http://youtu.be/t4MN1q9X0Us

Thanks again!

Jim

That is a great application of this library. I have always found the typical color organ somewhat annoying in that it seems to respond mainly to the bass beats. The use of averaging definitely gives it a more pleasing appearance IMHO. I like your project – thanks for sharing it.

Mayabe a dumb question, but how can i run this library ?

Should I copy folder ststistics to my libs folder in Arduino – i did it but this not work.

here is my (Your) code

#include

#include

Statistics stats(10);

void setup()

{

Serial.begin(9600);

}

void loop()

{

int data = analogRead(A0);

stats.addData(data);

Serial.print(“Mean: “);

Serial.print(stats.mean());

Serial.print(” Std Dev: “);

Serial.println(stats.stdDeviation());

}

Your include statements are blank. What kind of error are you getting? You need to provide more detail.

I renamed README.md to plp.ino

plp:49: error: ‘Statistics’ does not name a type

plp.ino: In function ‘void loop()’:

plp:59: error: ‘stats’ was not declared in this scope

You need to include Statistics.h in your sketch. That includes the library. I have updated the readme and added examples in GitHub. Just do a pull and you’ll be all set.

## 5 Trackbacks

[...] of data on a small device. [Scott Daniels] has some help for you in this arena. He explains how to manage statistical calculations on your collected data without eating up all the RAM. The library which he made available is targeted for the Arduino. But the concepts, which he [...]

[...] of data on a small device. [Scott Daniels] has some help for you in this arena. He explains how to manage statistical calculations on your collected data without eating up all the RAM. The library which he made available is targeted for the Arduino. But the concepts, which he [...]

[...] of data on a small device. [Scott Daniels] has some help for you in this arena. He explains how to manage statistical calculations on your collected data without eating up all the RAM. The library which he made available is targeted for the Arduino. But the concepts, which he [...]

[...] Statistics on the Arduino [...]

[...] Lib from here [...]