Building datasets

Data scientists often need hundreds of thousands of data points in order to build, train, and test machine learning models. In some cases, this data is already pre-packaged and ready for consumption. Most of the time, the scientist would need to venture out on their own and build a custom dataset. This is often done by building a web scraper to collect raw data from various sources of interest, and refining it so it can be processed later on. These web scrapers also need to periodically collect fresh data to update their predictive models with the most relevant information.

A common use case that data scientists run into is determining how people feel about a specific subject, known as sentiment analysis. Through this process, a company could look for discussions surrounding one of their products, or their overall presence, and gather a general consensus. In order to do this, the model must be trained on what a positive comment and a negative comment are, which could take thousands of individual comments in order to make a well-balanced training set. Building a web scraper to collect comments from relevant forums, reviews, and social media sites would be helpful in constructing such a dataset.

These are just a few examples of web scrapers that drive large business such as Google, Mozenda, and Cheapflights.com. There are also companies that will scrape the web for whatever available data you need, for a fee. In order to run scrapers at such a large scale, you would need to use a language that is fast, scalable, and easy to maintain.