Business, government and academic activities almost always require the collection and analysis of data. One of the ways to represent numerical data is through graphs, histograms and charts. These visualization techniques allow people to gain better insight into problems and devise solutions. Gaps, clusters and outliers are characteristics of data sets that influence mathematical analysis and are readily visible on visual representations.
Holes in the Data
Gaps refer to missing areas in a data set. For example, if a scientific experiment collects temperature data in the range of 50 degrees Fahrenheit to 100 degrees Fahrenheit, but nothing between 70 and 80 degrees, that would represent a gap in the data set. A line plot of this data set would have "x" marks for temperatures between 50 and 70 and again between 80 and 100, but there would be nothing between 70 and 80. Researchers can dig deeper and explore why certain data points do not show up in a collected sample.
Clusters are isolated groups of data points. Line plots, which are one of the ways to represent data sets, are lines with "x" marks placed above specific numbers to depict their frequency of occurrence in the data set. A cluster is depicted as a collection of these "x" marks in a small interval or data subset. For example, if the exam scores for a class of 10 students are 74, 75, 80, 72, 74, 75, 76, 86, 88 and 73, the most "x" marks on a line plot would be in the 72-to-76 score interval. This would represent a data cluster. Note the frequency for 74 and 75 is two, but for all other scores, it is one.
At the Extremes
Outliers are extreme values -- data points that lie significantly outside other values in a data set. An outlier must be significantly less than or greater than the majority of numbers in a data set. The definition of "extreme" depends on the circumstance and a consensus of the analysts involved in the research. Outliers might be bad data points, also known as noise, or they might contain valuable information about the phenomenon being investigated and the data collection methodology itself. For example, if class scores are mostly in the 70-to-80 range, but a couple of scores are in the low 50s, those might represent outliers.
Putting it All Together
Gaps, outliers and clusters in data sets can impact the results of mathematical analysis. Gaps and clusters might represent errors in the data collection methodology. For example, if a telephone survey polls only certain area codes, such as low-income housing complexes or high-end suburban residential areas, and not a broad cross-section of the population, chances are there will be gaps and clusters in the data. Outliers can skew the mean or average value of a data set. For example, the mean or average value of a data set consisting of four numbers -- 50, 55, 65 and 90 -- is 65. Without the outlier 90, however, the mean is about 57.