Shell script for standard deviation, arithmetic mean and median

I was looking for a Shell script that could calculate the standard deviation of a data set. I found the one at http://tldp.org/LDP/abs/html/contributed-scripts.html#STDDEV that returns the arithmetic mean as well. I added one more function for calculating the median of the data set. The modified script follows:

#!/bin/bash

# sttdev.sh: Standard Deviation

# Original version obtained from: http://tldp.org/LDP/abs/html/contributed-scripts.html#STDDEV

# 2009-03-09: Panagiotis Kritikakos
# Function stat_median() added for calculating the median value
# of the data set

# ------------------------------------------------------------
#  The Standard Deviation indicates how consistent a set of data is.
#  It shows to what extent the individual data points deviate from the
#+ arithmetic mean, i.e., how much they "bounce around" (or cluster).
#  It is essentially the average deviation-distance of the
#+ data points from the mean.

# =========================================================== #
#    To calculate the Standard Deviation:
#
# 1  Find the arithmetic mean (average) of all the data points.
# 2  Subtract each data point from the arithmetic mean,
#    and square that difference.
# 3  Add all of the individual difference-squares in # 2.
# 4  Divide the sum in # 3 by the number of data points.
#    This is known as the "variance."
# 5  The square root of # 4 gives the Standard Deviation.
# =========================================================== #

count=0         # Number of data points; global.
SC=9            # Scale to be used by bc. Nine decimal places.
E_DATAFILE=90   # Data file error.

# ----------------- Set data file ---------------------
if [ ! -z "$1" ]  # Specify filename as cmd-line arg?
then
  datafile="$1" #  ASCII text file,
else            #+ one (numerical) data point per line!
  datafile=sample.dat
fi              #  See example data file, below.

if [ ! -e "$datafile" ]
then
  echo "\""$datafile"\" does not exist!"
  exit $E_DATAFILE
fi
# -----------------------------------------------------
arith_mean ()
{
  local rt=0         # Running total.
  local am=0         # Arithmetic mean.
  local ct=0         # Number of data points.

  while read value   # Read one data point at a time.
  do
    rt=$(echo "scale=$SC; $rt + $value" | bc)
    (( ct++ ))
  done

  am=$(echo "scale=$SC; $rt / $ct" | bc)

  echo $am; return $ct   # This function "returns" TWO values!
  #  Caution: This little trick will not work if $ct > 255!
  #  To handle a larger number of data points,
  #+ simply comment out the "return $ct" above.
} <"$datafile"   # Feed in data file.

sd ()
{
  mean1=$1  # Arithmetic mean (passed to function).
  n=$2      # How many data points.
  sum2=0    # Sum of squared differences ("variance").
  avg2=0    # Average of $sum2.
  sdev=0    # Standard Deviation.

  while read value   # Read one line at a time.
  do
    diff=$(echo "scale=$SC; $mean1 - $value" | bc)
    # Difference between arith. mean and data point.
    dif2=$(echo "scale=$SC; $diff * $diff" | bc) # Squared.
    sum2=$(echo "scale=$SC; $sum2 + $dif2" | bc) # Sum of squares.
  done

    avg2=$(echo "scale=$SC; $sum2 / $n" | bc)  # Avg. of sum of squares.
    sdev=$(echo "scale=$SC; sqrt($avg2)" | bc) # Square root =
    echo $sdev                                 # Standard Deviation.

} <"$datafile"   # Rewinds data file.

stat_median() {
  NUMS=(`sort -n $1`)
  TOTALNUMS=${#NUMS[*]}
  MOD=$(($TOTALNUMS % 2))
  if [ $MOD -eq 0 ]; then
    ARRAYMIDDLE=$(echo "($TOTALNUMS / 2)-1" | bc)
    ARRAYNEXTMIDDLE=$(($ARRAYMIDDLE + 1))
    MEDIAN=$(echo "scale=$SC; ((${NUMS[$ARRAYMIDDLE]})+(${NUMS[$ARRAYNEXTMIDDLE]})) / 2" | bc)
  elif [ $MOD -eq 1 ]; then
    ARRAYMIDDLE=$(echo "($TOTALNUMS / 2)" | bc)
    MEDIAN=${NUMS[$ARRAYMIDDLE]}
  fi
  echo $MEDIAN
}

# ======================================================= #
mean=$(arith_mean); count=$?   # Two returns from function!
std_dev=$(sd $mean $count)
median=$(stat_median $1)

echo
echo "Number of data points in \""$datafile"\" = $count"
echo "Arithmetic mean (average) = $mean"
echo "Standard Deviation = $std_dev"
echo "Median number (middle) = $median"
echo
# ======================================================= #

exit

#  This script could stand some drastic streamlining,
#+ but not at the cost of reduced legibility, please.

# ++++++++++++++++++++++++++++++++++++++++ #
# A sample data file (sample1.dat):

# 18.35
# 19.0
# 18.88
# 18.91
# 18.64

# $ sh stddev.sh sample1.dat

# Number of data points in "sample1.dat" = 5
# Arithmetic mean (average) = 18.756000000
# Standard Deviation = .235338054
# Median number (middle) = 27.35
# ++++++++++++++++++++++++++++++++++++++++ #

I guess that could be done in Perl with fewer lines and using proper modules for the calculation, but I’m not in love with Perl… yet…

12 thoughts on “Shell script for standard deviation, arithmetic mean and median

  1. Michael Iatrou

    I guess that could be done in Perl with fewer lines and using proper modules for the calculation, but I’m not in love with Perl…

    You should definitely fall in love with Perl, ASAP:

    #!/usr/bin/env perl

    use strict;
    use warnings;
    use Math::NumberCruncher;

    my @data;

    open (IN, “<“, “$ARGV[0]”) or die (“Cannot open $ARGV[0]: $!\n”);
    while () {
    push @data, $1 if (/(\d+)/);
    }
    close (IN);

    printf(“Mean: %f\tMedian: %f\tStdDev: %f\n”,
    Math::NumberCruncher::Mean(\@data),
    Math::NumberCruncher::Median(\@data),
    Math::NumberCruncher::StandardDeviation(\@data));

  2. B

    Or you could do it even shorter with awk on the command line:

    awk ‘{ sum += $1; sumsq += $1*$1 } END { printf “Mean: %f, Std: %f\n”, sum/NR, sqrt(sumsq/NR – (sum/NR)^2) }’ filename

    or if you only want certain lines in a file (ie simulation output):
    grep ‘Thrust:’ filename | awk …

    or the last N lines (ie steady-state):
    grep ‘Thrust:’ filename | tail -N | awk…

  3. Highwind

    @ B:

    with the problem that the Std is not correct in your command.
    It should be (with M the mean):
    Std = sqrt( (1/N) * sum (x_i – M)^2 )
    you compute:
    wrong = sqrt( ((1/N) * sum (x_i^2)) – M^2 )

    You need the mean inside the loop already…

  4. Pingback: The IT Detecive Agency: FTPs and other things going slow to one part of one location | Dr John's Tech Talk

  5. A bit off topic but perhaps will help someone.
    I had to chance the following in the perl code

    open (IN, “<”, “$ARGV[0]“) or die (“Cannot open $ARGV[0]: $!\n”);
    while () {
    push @data, $1 if (/(\d+)/);
    }

    to
    open (IN, “<”, “$ARGV[0]“) or die (“Cannot open $ARGV[0]: $!\n”);
    @data=;

    in order to make the code work with decimal number.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s