Making Histograms

 

 

What is a Histogram?

A histogram is "a representation of a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies." (Online Webster's Dictionary)

If that sounds complicated, the concept really is pretty simple. We graph groups of numbers according to how often they appear. Thus if we have the set {1,2,2,3,3,3,3,4,4,5,6}, we can graph them as in the plot to the right.

This graph is pretty easy to make and gives us some useful data about the set. For example, the graph peaks at 3, which is also the median and the mode of the set. The mean of the set is 3.27—also not far from the peak. The shape of the graph gives us an idea of how the numbers in the set are distributed about the mean; the distribution of this graph is wide compared to size of the peak, indicating that values in the set are only loosely bunched around the mean.

To learn more about the mean, the median, and the mode, take a look in Statistics How To.

 

 

How is a Real Histogram Made?

The example above is a little too simple. In most real data sets almost all numbers will be unique. Consider the set {3, 11, 12, 19, 22, 23, 24, 25, 27, 29, 35, 36, 37, 45, 49}. A graph which shows how many ones, how many twos, how many threes, etc. would be meaningless. Instead, we bin the data into convenient ranges. In this case, with a bin width of 10, we can easily group the data as you see in the table and the graph to the right..

Note that the median is 25 and that there is no mode; the mean is 26.5.

 

 

Let's look at some histograms:
Of course, part of the power of histograms is that they allow us to analyze extremely large datasets by reducing them to a single graph that can show primary, secondary and tertiary peaks in data as well as give a visual representation of the statistical significance of those peaks. To get an idea, look at these three histograms:

This plot represents data with a well-defined peak that is close in value to the median and the mean. While there are "outlyers," they are of relatively low frequency. Thus it can be said that deviations in this data group from the mean are of low frequency. If this were a mass plot in particle physics, we'd say the mass is understood with good precision.
In this plot the peak is still fairly close to the median and the mean, but it is much less defined. It is harder to tell from the plot what the exact location of the peak is. There are almost as many values close to the peak as at the peak itself and outlyers are frequent. As a particle physics mass plot, this gives an imprecise and uncertain mass of a particle.
Where are the median and the mean? It is hard to tell; it also may not be relevant. There are two peaks in this plot: a taller primary peak as well as a shorter secondary peak. This could indicate either very poor definition of one signal in the data or, more likely, two signals. In particle physics, this could show two separate particles or, as is often the case, a large signal with "background" particles and a smaller signal (sometimes very small), called a "bump," which shows the actual particle under study.

 

Resources

  • Statistics How To - Basics of statistics for the rest of us.
  • Shodor Histogram Page - This is a nice interactive histogram page in which you can choose different sample histograms and vary the bin size.
  • Histograms: The Basics - This QuarkNet Data Activity enables students to learn more about histograms by doing.