wao.ai logowao.ai blog - data labeling and applied machine learning
Blog HomeCan AI Guess Our Sex?Dataset vs Ground-Truth DatasetHow do I get a dataset?How do I get a ground-truth dataset?Training Dataset vs Ground-truth DatasetIntroducing the Universal Data ToolLabel Bounding Boxes with the Universal Data ToolSingle-Label Image Classification with Google AutoMLSingle-Label Image Classification with Keras

Dataset vs Ground-Truth Dataset

Datasets are a collection of samples of data. The data can be images, audio, text, matrices of numbers or even rows of an Excel spreadsheet. When doing machine learning, you'll usually need a ground-truth dataset. A ground-truth dataset is a regular dataset, but with annotations added to it. Annotations can be boxes drawn over images, written text indicating samples, a new column of a spreadsheet or anything else the machine learning algorithm should learn to output.

dataset

A couple quick examples:

  • If you're predicting risk of someone defaulting on a loan, your dataset may be a spreadsheet containing information about people. Your ground-truth dataset would include a column that indicates if the person defaulted or not
  • If you're trying to identify animals in a picture, your dataset might be images of pets. Your ground-truth dataset would be images of pets with bounding boxes showing where the animals are in the image, as well as labels that indicate what animal is in the box.

ground-truth dataset

How do I get a dataset?

Many datasets companies use come from customer data, such as engagement or spending information that could help predict when a promotion should be sent to a user. There are also a host of online datasets available:

  • The python library scikit-learn has some easy python methods for importing toy datasets that can be used for gaining familiarity with machine learning
  • The U.S. Government releases a lot of data about Public Safety, Research, Education and more at data.gov.
  • Google recently introduced a dataset search that helps you find freely available datasets

How do I get a ground-truth dataset?

There are many freely available tools for annotating a dataset to make it a ground-truth dataset such as Universal Data Tool, Label Studio and Labelimg.

You can see screenshots and interactive search of different machine learning tools using Compare Data Tools.

Companies will also annotate your data for you. The biggest challenge for annotating data externally is tracking the quality of your labels. We recommend taking a sample slice of your data from the annotation company and measure the percentage of samples that are correctly annotated.

Training Dataset vs Ground-truth Dataset

"Training dataset" and "ground-truth dataset" are sometimes used interchangeably, but they are actually slightly different. When building a machine learning dataset, you break your ground-truth dataset into two smaller datasets: the training dataset and the testing dataset. You then train your machine learning algorithm on the training dataset, and test it's ability to work on the testing dataset.

Learn about applied machine learning with a new article every two weeks. Unsubscribe anytime.
You can also follow @waoai_ on twitter.