Grouping data points within a scatter plot

A basic scatter plot has a set of points plotted at the intersection of their values along x and y axes. Sometimes, we might wish to further distinguish between these points based on another value associated with the points. In this recipe, we will learn how we can group data points using colors.

Getting ready

To try out this recipe, start R and type the recipe in the command prompt. You can also choose to save the recipe as a script so that you can use it again later on.

We will also need the lattice and ggplot2 packages. The lattice package is included automatically in the base R installation, but we will need to install the ggplot2 package. To do this, run the following command in the R prompt:

install.packages("ggplot2")

How to do it...

As a first example, let's use the xyplot() command of the lattice library:

library(lattice)

xyplot(mpg~disp,
data=mtcars,
groups=cyl,
auto.key=list(corner=c(1,1)))

How it works...

In this example, we used the xyplot() command to plot mpg versus disp from the preloaded mtcars dataset. We will understand this better if we look at the actual dataset. Type mtcars in the R prompt and hit the Enter key. Let's look at a sample of the data in order to see the row names and first three columns of data:

mtcars[1:6,1:3] 
                       mpg   cyl   disp
Mazda RX4             21.0     6    160
Mazda RX4 Wag         21.0     6    160
Datsun 710            22.8     4    108
Hornet 4 Drive        21.4     6    258
Hornet Sportabout     18.7     8    360
Valiant               18.1     6    225

So, we plotted mpg against disp, but we also used the groups argument to group the data points by cyl. This tells xyplot() that we would like to highlight the data points by different colors based on the number of cylinders (cyl) each car has. Finally, the auto.key argument is set to add a legend so that we know what values of cyl each color represents. The auto.key argument can take a list of values. The only one we have provided here is the location given by the corner argument, which we set to c(1,1), representing the top-right corner. We can also simply set auto.key to TRUE, which will draw the legend in the top margin outside the plotting area.

There's more...

The xyplot() function has slightly obscure arguments. If you look at the help file on xyplot() (by running ?xyplot), you will see that there are a lot of arguments that can be used to control many different aspects of the graph. A simpler alternative to xyplot() is using the functions from the ggplot2 package. Let's draw the same plot using ggplot2:

library(ggplot2)
qplot(disp,mpg,data=mtcars,col= as.factor(cyl))

First, we load the ggplot2 library and then we use the qplot() function to create the preceding graph. We passed disp and mpg as the x and y variables, respectively (note that we can't use the y~x notation in qplot). To group by cyl, all we had to do was set the col argument to cyl. This tells qplot that we want to group the points based on the values of cyl and represent them by different colors. The legend is automatically drawn to the right.

Note that we set col to as.factor(cyl) and not just cyl. This is to make sure that cyl is read as a factor (or a categorical value). If we just use cyl, then the plot is still the same, but the color scale and legend uses all the values between 4 and 8 as it takes cyl as a numerical variable.

Thus, it is easier and more intuitive to produce a better-looking graph with ggplot2.

See also

We will use ggplot2 to group data points by size and symbol instead of color in the next recipe.