Collecting the data

The data we will be using is the match history data for the NBA for the 2015-2016 season. The website  http://basketball-reference.com contains a significant number of resources and statistics collected from the NBA and other leagues. To download the dataset, perform the following steps:

  1. Navigate to http://www.basketball-reference.com/leagues/NBA_2016_games.html  in your web browser.
  2. Click Share & more.
  3. Click Get table as CSV (for Excel).
  4. Copy the data, including the heading, into a text file named basketball.csv.
  5. Repeat this process for the other months, except do not copy the heading.

This will give you a CSV file containing the results from each game of this season of the NBA. Your file should contain 1316 games and a total of 1317 lines in the file, including the header line.

CSV files are text files where each line contains a new row and each value is separated by a comma (hence the name). CSV files can be created manually by typing into a text editor and saving with a .csv extension. They can be opened in any program that can read text files but can also be opened in Excel as a spreadsheet. Excel (and other spreadsheet programs) can usually convert a spreadsheet to CSV as well.

We will load the file with the pandas library, which is an incredibly useful library for manipulating data. Python also contains a built-in library called csv that supports reading and writing CSV files. However, we will use pandas, which provides more powerful functions that we will use later in the chapter for creating new features.

For this chapter, you will need to install pandas. The easiest way to install it is to use Anaconda's conda installer, as you did in Chapter 1, Getting Started with data mining to install scikit-learn:
$ conda install pandas
If you have difficulty in installing pandas, head to the project's website at http://pandas.pydata.org/getpandas.html and read the installation instructions for your system.