Uncertainty, Statistics, and Measurement

Introduction: It’s difficult, although not impossible, to do science completely without numbers. Eventually, measurement is required for verification of theory, and it is impossible to do measurement without numbers. This lab is about numbers: how to get them, how to use them, when to trust them; use this as a resource for data analysis in future labs.

Theory: We’ll break this down to three topics: measurement and uncertainty, plotting and curve fitting, and statistics.

Measurement and Uncertainty: no measuring instrument is perfect: any measuring tool will have two limitations: accuracy and precision.

Accuracy is how close to defined standards a tool can be. For instance if an odometer says this distance: |---------------| is a mile, well, it obviously isn’t. While it isn’t necessary for us to confirm that every instrument provided by the school conforms with the National Bureau of Standards, it’s something you should think about when given equipment in your future career.

Precision is a measure of the finest gradation on the instrument’s measuring scale. If a meter stick is marked only in meters, it is not as precise as one marked in centimeters, which in turn is not as precise as one marked in millimeters. Obviously the more precise the tool is the better the measurement, down to dimensions where quantum uncertainty is a factor. However, the finest gradation reveals an inherent weakness: your measurement can be no better than one-half the smallest gradation. This is a fundamental truth about measurement, and this course requires an explicit statement of uncertainty for every result.

How to do this? As a simple example, let’s say you are determining the area of a table top. You measure with a meterstick marked down to the millimeter scale and find the table top is 89.4 cm by 159.8 cm. You would obtain, ignoring significant figures for the moment, 14,286.12 cm2. The worst case (and you are always looking for the worst case) would be if your measurements were 0.5mm off. The table may actually be 89.45 cm by 159.85 cm for an area of 14,298.5825 cm2. The uncertainty of your calculation would be 12.4625 cm2, or about 0.87%. (N.B. this has nothing to do with percent difference or percent error.) With such a small uncertainty you should be very confident in your determination, easily lost when the proper number of significant figures (i.e. 3, limited by the three places in 89.4 cm) is used.


But suppose your meterstick was only marked in meters; you could at best say the table was 100 cm by 200 cm with an uncertainty in measurement of 50 cm. You assume the worst, that each measurement is low by 50 cm. Your area would be 20,000 cm2 with an uncertainty of calculation of 17,500 cm2! (Be sure you can replicate this uncertainty mathematically.) If you are thinking that, “heck, I could obviously see that the table isn’t 1.5m by 2.5m”. True, and suppose you are measuring the wavelength of light, down in the 500 nm range – you won’t be able to “see” the obvious down there.

We are illustrating the principle; don’t get lost in the weeds here. For every result you provide this semester, you must qualify it by an uncertainty, either +/- a value or a per cent. You can either maximize the measurement value by the uncertainty in measurement or minimize it, whatever makes the worst case scenario. Here’s another example: supposed you are measuring a disk to try to determine pi. With the “precise” meterstick above, you measure the diameter of the disk to be 7.3 cm and the circumference (by carefully rolling the disk along the meterstick) to be 22.9 cm. Pi would be 3.13698630137 (again, silly with so many digits). With measurement uncertainty, you’d use 22.95 cm and 7.25 cm. Why? Because you want to maximize the result to find the worst case. The uncertainty in your result would be 0.0285309400094 (when will the over-digititus end?)

Another way to do this is with differentials using this formula:

Equation 1


This may or may not make sense, depending on what level of calculus you’ve achieved. The resulting uncertainty, while perhaps slightly different in value, is nonetheless valid. I'll be glad to show you how this works, but we'll reserve its official induction into the lab for 202/203.


Zeroing fits into the idea of precision. If a caliper is closed yet still reads a non-zero number, every measurement will be off by that number. Be sure when something displays zero it really is zero.


Plotting And Curve Fitting: First off, let’s get this out of the way: Excel will happily provide a trendline for your graph (what Excel calls a “chart”). The reliability of this function is debatable.

Let us consider the appropriateness of a particular best fit curve. Consider the following set of data:

1

1.123501

2

2.178959

3

3.214834

4

4.359026

5

5.574478

6

6.448291

7

7.554502

8

8.073998

9

9.877577

10

10.0144

Table 1

It would graph out like this:

Graph 1

(N.B.: this graph is woefully under-labeled.) Anyone with the slightest mathematical skill would say this looks like a linear function; one could easily draw a line that spears through most of the points. Yet Excel would gladly fit a sixth-order polynomial to it:

Graph 2

This line manages to touch all of the points, but is it appropriate? Is it better than a line just because it hits all the data? How good a predictor of future values would a sixth-order polynomial be?

Consider this data set:

1

1.56279

2

5.480603

3

3.701223

4

3.853476

5

14.46292

6

11.41096

7

456.5775

8

757.9717

9

135.8949

10

4.290305

Table 2

And its plot:

Graph 3

What the heck do you do with this? Yes, you could thread a line through the cloud, but you wouldn’t hit many data. And what about those values at x = 9 and 10? Excel would gladly come up with a sixth-order polynomial to hit as many points as possible, but would it mean anything? The numbers actually come from an exponential function, not a polynomial.

And what about this:

Graph 4

There seems to some linearity, but look how much farther each “y” lies from the trendline for each increasing “x”.

The moral of the story: don’t implicitly trust that any trendline Excel comes up with is a good representation of the data. You need to bring some expertise to the problem.

We will examine how much trust we can place in the data and the trendline upon which it is based a little later on. Now, consider this actual data from the Hong Kong Environmental Protection Department:

Figure 1

They’ve managed to get six orders of magnitude on one attractive graphic. How’d they do that?

Examine this data: distances in the Milk Way:

Sun to:

miles

Earth

93,000,000

Saturn

893,750,000

Heliopause

16,000,000,000

Oort Cloud

697,500,000,000

Alpha Centauri

25,284,000,000,000

Betelguise

2,510,760,000,000,000

Eta Carina

44,100,000,000,000,000

Galactic Nucleus

152,880,000,000,000,000

Table 3

Any scale fine enough to show 93E6 miles will stretch a great distance to display 1.52E17 miles.


The answer to both tables as well as the Graph 3 is to use logarithmic scaling. Look at the data from Table 2 when the natural log of each y-value is taken:

Graph 5

Spearing a line through this data will be much more pleasing and informative.


Logarithms were invented in the early 17th century by John Napier in part to deal with large numbers such as those encountered by Johannes Kepler when he was determining distances in the Solar System. You can choose to either make a formula to take the natural log of values or format the y-axis of your chart to plot logarithmically. The temptation will be to do the latter, but unless you have an integer base for your log your scaling might look weird.

Two final things abour plotting: sometimes you are concerned with only a small domain of the data, say, behaviors of a system around some important value (think water around 273K). A trendline that fits this domain but fails at much higher and lower values may be just fine for interpolation but not extrapolation. A good example of this is the 18th C. Bode-Titius law of orbitial radii:

And a word about signal and noise. Signal is what you want to examine and noise is data not required:

Graph 6

How you get rid of this noise is beyond the scope of the course, but nonetheless you should recognize it when you see it.

Then there is bias or offset, a value added or subtracted to every datum:

Graph 7

This is easily fixed: subtract 1 from every y-value IF you are sure the bias is inappropriate; there's always the danger of eliminating signal with the noise.


Statistics: This section deals with ways to assess your data for trustworthiness. Statistics is several courses in itself. However, using some basic principles and formulae on the computer can tell you a lot about how well your lab ran. Not all principles are useful in all cases: their relevance depends on the experiment.


MEAN: Also called the average or arithmetic mean. Simply the (sum of x)/N, where the x's are the data and N is the number of data. For Excel and similar spreadsheets, use the function AVERAGE(mm:nn). mm and nn are the beginning and ending cells for a range (x with a bar over it is sometimes used for <x>).


RMS: As the name implies, RMS means the square root of the average of the square of each data. The order is important: it is NOT the square root of the square of the average of the data! It looks like this:

Equation 2


When numbers vary between positive and negative values, their arithmetic mean may be zero when their average in reality is something else. For instance, the average of a sine function over one period is zero, yet the average voltage coming from a typical wall socket is not zero. RMS is a way of incorporating negative values in the data realistically. This is used a lot in Physics. Excel does not have a built-in formula so you'll have to use Equation 2 where appropriate.

STANDARD DEVIATION: An important concept. It is essentially the RMS of the deviations from the mean or known value. You use it when your data points cluster around a given value or average. This tells you essentially the RMS error. A low standard deviation says your values are close to the average. Moreover, a statistician will talk about being within one or two standard deviations of the mean. This is an indication of statistical significance. For Excel, use the function STDEV(mm:nn). This finds s using the average of the range (mm:nn). If your data has a normal distribution, 68.27% of all your data lie between <x> - s & <x> + s. That is one standard deviation. (A normal distribution is a fair indicator of consistent results). Furthermore, 95.45% of your data lie within 2 standard deviations, and 99.73% lie within 3 standard deviations. If your data doesn't obey these rules, something may be wrong. The graph that follow is for population, not sample standard deviation, but the concept is the same:

Graph 8

Unfortunately, what may be wrong is an insufficient number of data points. Statistics with less than 30 data (N < 30) is chancy at best. This should hint at a rule: get lots of data! For more information on normal distributions and statistical significance research z score, level of significance, and one and two tailed tests. Incidentally, this is how I calculate your grade ranges. Lastly, a function known as variance is closely related to standard deviation, simply s2. It is used for very large data sets.


CORRELATION: How good is the fit? How closely does y depend on x? To find out we calculate r, the linear correlation coefficient, which lies between -1 and 1. One is perfect correlation, zero is no correlation, and negative one is anti-correlation, sort of the reverse of the relation you expected.

Graph 9

The result of the CORREL function is -0.449895899. On the chart you see R2, also known as the coefficient of determinism, used for predicting future values. If you need to make a prediction use R2; otherwise the unadorned R is preferable because it conveys dependency.

Task:

Equipment:

Procedure:

  1. Measure the block with the micrometer and find its volume: include uncertainty.
  2. Measure the block again, this time with the measuring stick: include uncertainty. Comment on the two results with the different measuring tools.
  3. Plot the Hong Kong Environmental Protection Department data with dB (green values) on the x-axis and the pressure (blue values) on the y-axis. For this graph use calculated logarithmic values.
  4. Make a bar chart of the Milky Way Distances data; format the x-axis to use logarithmic scaling.
  5. Use the data in the second table to reproduce the third graph. Next, take the natural log of these data and plot that in a new chart; include a linear trendline and display the equation and correlation on the chart. Finally make a third chart, using the original data but format the y-axis to use natural logarithmic scaling. Comment on Excel’s choice of y-axis gradations for this third graph.
  6. Generate a sine wave using the sine function in Excel. Use 128 x-values; what you enter will be in radians, so take this into account. The period of the function is up to you, but make the amplitude 1. Find the mean of this function over your 128 range values. Then find the RMS value of this function (square the function, find the average of these values and then take the root).
  7. Make a new sine wave from your sine values but give it a bias of -1.
  8. Does this data display a normal distribution?
    1. First, find the average and the standard deviation of the data set.
    2. Make a column of these values: MEAN - 2*STDEV, MEAN - 1*STDEV, MEAN, MEAN + 1*STDEV, MEAN + 2*DEV. This is your BIN column.
    3. Select the five cells next to the BIN column you just made.
    4. Go to the Formula/More Functions/Statistical/ Frequency and follow the prompts.
    5. IMPORTANT! When the bin and data are filled in, DON'T press the OK button; use <CTRL>,left<SHIFT>,<ENTER>.
      1. Notice: left<SHIFT>.
      2. This is what you need to do to enter an array, not just a single value.
    6. Now plot the BINS as X and FREQUENCY as Y. Use an area or column style chart.
  9. Try rejecting all values below 25%; does it now?


Questions:

  1. If you allowed Excel to give you a trendline for the sine function you plotted, comment on how useful that would have been.
  2. Comment on Excel’s choice of y-axis gradations for your third graph from the second table’s data.
  3. Find three examples of graphs using a logarithmic x-axis on the internet and include them in your spreadsheet with proper attribution.