Statistics on the Arduino (also Pic or any microcontroller)

Normal Distribution Curve

Courtesy: Wikipedia (Click to enlarge)

When collecting any data on the Arduino, it won’t be very long before you need to calculate some statistics on that data. While statistics can be a pretty intense mathematical field, some very basic statistics such as calculating the mean and standard deviation can be invaluable for many applications.

Fortunately, it is not only easy to make these calculations, but its usefulness can extend beyond just statistics. Many times data from a sensor may not be stable. Touch sensing is a good example. Each individual value (data point) can vary quite a bit, making it difficult to make an accurate determination. By averaging these values, a better decision can be made.

In a similar vein, by calculating the standard deviation, you can assess the quality of the values obtained. A large deviation from the mean can indicate problems with your sensor.

Design Challenges

Because statistics often require some extensive data collection, normal methods can be too memory intensive for small microcontrollers like the Arduino or ATtiny based designs. Normally each value is stored and, when all the data is collected, various calculations such as the mean and standard deviation are calculated on the data set. Storing this much data can easily overrun the small amount of SRAM available.

To solve this problem, keeping running totals of the data can be employed, thus permitting the storage of a single value instead of a large array of values. One total is used for the mean, and another is used for the standard deviation. This is how it’s done.

Theory

The Mean

The arithmetic mean is the most common type of averaging a data set. It is calculated by summing the data set and dividing by the number of points in that set. If we were collecting a fixed set of values (data points), then it would be simple to just store the sum by adding each value as it is gathered, which is how we will do it.

The wrinkle comes when we want to gather a running total. Put another way, we want to store some N values, but we want to know what the running average is for the latest N values collected. To do that, we need some means of sloughing off the oldest data point before we add the latest one. Since using a collection is ruled out, how do we do this for a running total?

The answer is quite simple once you see it. It is best illustrated with an example. Let’s say we are summing 10 values at any given time. We have collected 10 values, and now need to add an 11th. The sum is 540, which makes the mean equal to 54. To make room for the latest value, we will simply subtract the mean from the total. While the mean is not in all likelihood the oldest value, it is the best representation of it. After all, the mean is what we are after, so that is what we will subtract.

To put it into the simplest equation, we have the following:

Total = Total * (N - 1) / N

where N is the number of samples.

In our example, this equation is expressed as:

Total = Total * 9/10

for our 10 data point total.

There is one last wrinkle to consider. Until we have gathered at least N values, we don’t want to discard any collected data. Therefore, we will keep track of how many samples we have collected when less than N. When the current number of samples is less than N, we simply add the value to the total. When it is equal to N, we will perform the previous calculation:

if (currNumSamples >= N)
  total = total * (N - 1) / N;
else
  currNumSamples++; // increment the current number

total += value;

The Standard Deviation

Now that we have stored the data necessary to calculate the mean, the next challenge is the standard deviation. The standard deviation is a useful measure of how closely grouped our data is around the mean. It is the square root of the variance, which is the value we will gather.

The variance is the sum of squares of the differences between the values and the mean:

Variance = \sum_{i=1}^{N}{\Delta P_i} / N

where \Delta P_i is the (value - mean)^2.

Since the variance is a sum, is looks like it might be possible to store a single total just like we did for the mean. The problem is the fact that we won’t have the mean until all the data is gathered.

The solution is to use a basic arithmetic property that we learned in grade school – the associative property. Specifically, for each \Delta P_i, we can write equivalently:

(val_i - mean)^2 = val_i^2 - 2*val_i * mean + mean^2

Substituting into our sum, we can write it as:

\sum_{i=1}^{N}{val_i^2} + N * mean^2 - 2*mean*\sum_{i=1}^{N}{val_i}

If we look carefully at the last sum: \sum_{i=1}^{N}{val_i}, we see that we can simply substitute N*mean for it, which reduces the equation to:

\sum_{i=1}^{N}{val_i^2} + N * mean^2 - 2*N*mean^2

Simplifying we have:

\sum_{i=1}^{N}{val_i^2} - N*mean^2

Putting it all together, we calculate our running total as:

varianceSum = varianceSum * (N - 1) / N;
varianceSum += sqr(val);

and calculate the variance as:

variance = [varianceSum - (N * mean^2)] / N

Lastly, we can calculate the standard deviation by getting the square root of the variance.

The Library

Putting it altogether into a handy class, it would look like the following:

class Statistics {
public:
  Statistics(int numSamples);

  void addData(float val);

  float mean() const;
  float variance() const;
  float stdDeviation() const;
};

Using It

To use the class is simple:

Statistics stats(10); // 10 data points to be collected in a running fashion

void loop() 
{
  stats.addData(digitalRead(pin));

  // some more code
  ...

  float aveVal = stats.mean();
  float quality = stats.stdDeviation();
}

The main decision to be made when using this class is how many data points to collect at a time. Since the memory storage is constant regardless of N, one might think to use a large N. In some cases that is acceptable. In others, there is a penalty for doing so. If for example we are trying to smooth some jittery data, the more points collected, the more stable the running mean. This stability also means that our response time to changing circumstances will be much slower.

It is similar to a low pass filter on your data. You can smooth out the steady state as much as you like, but at a cost of a longer settling time in the transient. The best way to determine the best sample size is to test different sizes in your application and choose the one that best manages the compromise between stability and responsiveness.

Other considerations

One last consideration is the fact that the inclusion of floating point math consumes over 200 bytes of SRAM in overhead. For many chips, particularly ATtiny chips or other micros with small amounts of SRAM, this overhead is too much.

All is not lost. The same library can be implemented using integer arithmetic as well. This last point raises another prospect – can the library be implemented as a C++ template and then instantiated using the data store type of choice?

It could, but usually applications fall into two categories – ones where the extra 200 or so bytes of ram is not a problem, or ones where every last byte is crucial. Therefore, two versions of the library are implemented – one for floating point and the other using integer arithmetic.

One last point – in the floating point case, should floats or doubles be used? Since both store any size quantity, either will work. Double precision provides only that advantage – precision. Since we are working with statistical quantities, precision is the last thing needed. Therefore, simple floats will suffice.

Where to Get It

Both libraries are available at GitHub. Each library is completely implemented in just a header file. These libraries are small, fast and efficient. In addition to the mean, variance and standard deviation, the total, min, and max data values can also be retrieved.

Conclusion

I have described a basic statistical library for the Arduino and other microcontrollers. It performs basic statistical calculations and is memory efficient.

Not only can it be used for gathering useful statistical information about sensory data, it can also be used to clean up dirty data such as that which comes from touch sensors. Constructed from a single class with only one member function used to store the data it is easy and fun to use.

Please offer your suggestions and bugs in the comments, and I will try to incorporate any improvements to the library. Alternatively, feel free to fork the library in Github and issue pull requests for any improvements you have made.

Mathematical expressions are rendered courtesy of QuickLaTeX.com

Statistics on the Arduino (also Pic or any microcontroller) by Provide Your Own is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Tech and tagged , , , . Section: . Bookmark the permalink. Both comments and trackbacks are currently closed.

14 Comments

  1. Posted January 14, 2013 at 3:32 pm | Permalink

    Very Informative write up….. Appreciated

  2. BBotany
    Posted January 14, 2013 at 5:03 pm | Permalink

    After reading through the integer version, I would describe this as a decent and definitely compact implementation. For well-behaved, size char mean and variance it should work.

    Note that m = m *((n-1)/n ) will not actually discard old data. Instead, it scales it so that the value you expect to store remains approximately bounded by N*mean for the mean, and N*(mean^2) for the sum of squares for the variance. If you put in a window length of two, and try the sequence {10, 10, 10, 20, 20, 20} you will see that you get a final number of 15, not the 2-observation mean of 20. If you used two pairs of accumulators you could implement a window that is constrained between N and 2N, however, by switching accumulators whenever the count reaches N, clearing the “new” accumulator , and always using both together for output. This would give you the long-term memoryless property that you desire.

    The “well behaved” warning is because of large N*value – particularly on readings that can vary widely. Overflow errors would wipe out the accuracy of all subsequent readings, and are reasonably likely to happen if there is respectable variance and auto-correlated errors.

    Why would I notice all this? Why, by burning myself while making a rate adaptive Morse decoder of course.

    • BBotany
      Posted January 14, 2013 at 5:15 pm | Permalink

      Correcting myself on the widowing: you do get exponential decay of the effect of out-of-window values, and the results in my example are not 15… Not 20 either (depending on truncation/rounding algorithm, should be about 18), but close enough for horseshoes and hand grenades.

    • Posted January 17, 2013 at 7:01 pm | Permalink

      You raise some good points. The data is accumulative by design, and discarding the data and starting over would be one way of changing that behavior. It turns out that old data does not really hang around forever though, but is sloughed off over time. In the example you provide, the mean should be 18.75 not 15. As you continue to add more 20’s to the accumulated data, the mean will approach 20. By choosing the appropriate window size you can cycle over to new data values as fast or slow as you would like. Put another way, if you want your running average to adapt quickly to new values, use a small sample size. If you want it to adapt more slowly, use larger sample size. The sample size does not change the storage, but how many of the last samples are accumulated.

      Sample size affects the effect of bad data as well. A small sample size will reflect the bad data more readily, but would also clear it out more quickly as well. A larger sample size keeps the bad data points around longer, but they will have less of an effect.

  3. Daid
    Posted January 15, 2013 at 4:13 am | Permalink

    I do not think the AVR compiler supports doubles and just implements them as floats.

    • Stephen H
      Posted May 15, 2014 at 4:29 pm | Permalink

      The Arduino Reference Section on Doubles states:

      “Double precision floating point number. On the Uno and other ATMEGA based boards, this occupies 4 bytes. That is, the double implementation is exactly the same as the float, with no gain in precision. On the Arduino Due, doubles have 8-byte (64 bit) precision.”

      I had totally missed this before, so thank you for pointing it out.

  4. Posted January 16, 2013 at 11:39 pm | Permalink

    Hey – Just wanted to say thanks for making this library available – I found it through Hackaday and realized it would help me elegantly solve one of the issues with a sound-reactive project I’m working on.

    Here’s a video of one of my experiments that is using your Statistics library to efficiently smooth out the readings from a spectrum analyzer IC to make a nice, flicker-free light show on a series of LEDs: http://youtu.be/t4MN1q9X0Us

    Thanks again!

    Jim

    • Posted January 17, 2013 at 7:20 pm | Permalink

      That is a great application of this library. I have always found the typical color organ somewhat annoying in that it seems to respond mainly to the bass beats. The use of averaging definitely gives it a more pleasing appearance IMHO. I like your project – thanks for sharing it.

  5. Max
    Posted February 2, 2013 at 12:12 pm | Permalink

    Mayabe a dumb question, but how can i run this library ?
    Should I copy folder ststistics to my libs folder in Arduino – i did it but this not work.
    here is my (Your) code

    #include
    #include

    Statistics stats(10);

    void setup()
    {
    Serial.begin(9600);
    }

    void loop()
    {
    int data = analogRead(A0);
    stats.addData(data);

    Serial.print(“Mean: “);
    Serial.print(stats.mean());
    Serial.print(” Std Dev: “);
    Serial.println(stats.stdDeviation());
    }

    • Posted February 14, 2013 at 12:27 am | Permalink

      Your include statements are blank. What kind of error are you getting? You need to provide more detail.

  6. mark C
    Posted August 10, 2013 at 11:37 pm | Permalink

    I renamed README.md to plp.ino

    plp:49: error: ‘Statistics’ does not name a type
    plp.ino: In function ‘void loop()’:
    plp:59: error: ‘stats’ was not declared in this scope

    • Posted August 15, 2013 at 12:27 am | Permalink

      You need to include Statistics.h in your sketch. That includes the library. I have updated the readme and added examples in GitHub. Just do a pull and you’ll be all set.

  7. Stephen H
    Posted May 15, 2014 at 4:15 pm | Permalink

    Nice little library. I really like simplicity of the mathematics involved. I have many lines of code dedicated to instantiating, sizing, and initializing arrays to hold my smoothing data. This really cuts it down, tremendously. Not just program size but memory usage as well.

    Watching the 100 sample video that Jim shared, I realized that with this code, I could greatly increase my smoothing window and increase my sampling rate (originally, 10 samples once every 6 seconds) and get a much smoother output from my sensors. I am working on a gardenbot/weather station and the sensitivity of some of the components makes the displayed values jump and change all the time, but with this code I will be able to make it much calmer in its changes.

  8. Posted July 6, 2015 at 8:14 pm | Permalink

    Hi,

    I am the author of WP-QuickLaTeX plugin you are using to render formulas on the website (thank you for using it!).

    Could you please turn-on the “Cache images locally” settings in QuickLaTeX->System settings?

    This will speed-up your website and ease loading on our server.

    Thank you.

5 Trackbacks