Starting with TensorFlow Datasets -part 2; Intro to tfds and its methods
After discussing about the tf.data pipeline API’s, in this article I want to talk about the TensorFlow Datasets library. If you are new to tf.data API’s I highly encourage you to look at the part1 of the series which introduces to the tf.data and it prerequisite to understand this library becuase the TensorFlow Datasets library returns data as a tf.data.Datasets.
Table of Contents:
- Brief Overview of the TensorFlow Datasets Library.
- Intro to TensorFlow Datasets Library.
- Vizualization methods of TensorFlow datasets library.
- Multiple ways to split your data into -> train, test and validate.
1. Brief Overview of the TensorFlow Datasets Library
From the offical docs:
TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as
tf.data.Datasets
, enabling easy-to-use and high-performance input pipelines.
The library consists of datasets spread across various Machine Learning and Deep Learning tasks.
1. Computer Vision related datasets for tasks like image classification, object detection, video etc.
2. Natural Language Processing related datasets for tasks like sentiment analysis, text classification, text summarization, Question and Answering etc.
3. The library also has datasets related to Unsupervised Learning, Reinforcment learning, Graph datasets and many more.
The website lists the plethora of datasets avilable with the library which you can the data from. I highly encourage you to look through the website. Personally I found a lot of exciting datasets which has spwaned quite a few new projects for me.
2. Intro to TensorFlow Datasets Library
Before we begin we need to install the TensorFlow Datasets Library. (it comes preinstalled on Google Collab ). Please run the following to install the tfds library on your machine.
pip install -q tfds
Once the library has been installed next we can download and load the datasets and start working with it.
The easiest way to download and load a dataset for your experiment is to use the tfds.load() method. The first argument to the load function is the name of dataset that you want to download here since we are using MNIST data we pass the string ‘mnist’. The catalouge lists an exhaustive list of datasets avilable.
The above code does the following:
- It downloads the mnist dataset (as a tfrecord)
- Loads the data in the form of tf.data.Dataset (from the downloaded tfrecord)
Now you can manipulate the dataset so loaded and build your pipeline around it. One of the most important reasons I use tfds is that it loads the datasets avilable as a tf.data.Datasets which helps me elimnate a load of loading and tranforming process.
3. A few important methods of TensorFlow Datasets Library
Since tfds is a library which helps us load the datasets into memory, we will be dealing with tfds.load a lot so let us take a look in detail about the various arguments that it supports.
Methods of tfds.load()
split=
: Which split to read (e.g.'train'
,['train', 'test']
,'train[80%:]'
,...)shuffle_files=
: Control whether to shuffle the files between each epoch.data_dir=
: Location where the dataset is saved ( defaults to~/tensorflow_datasets/
)with_info=True
: Returns thetfds.core.DatasetInfo
containing dataset metadatadownload=False
: Disable downloadas_supervised=True
, you can get a tuple(features, label)
instead for supervised datasets.
a. Load with with_info=True
Let’s check what info we get when we set with_info=True
when we load our datasets. It gives the details about the dataset
b. Load with as_supervised=True
4. Vizualization methods of TensorFlow datasets library:
Tensorflow dataset comes with a couple of functions which let you visualize datasets, there are 2 of such methods tfds.as_dataframe
and tfds.show_examples.
Let’s take a closer look at these 2 methods:
tfds.as_dataframe:
Helps in visualizing the images, audio, texts, videos, etc. Following is how you use the as_dataframe method. One thing to note is that the method needs thewith_info=True
. In the following example we will load the mnist dataset and use the take() tf.data.Datasets method to pick 4 examples from the load dataset and visualize it.
2. tfds.show_examples
: This methods plots a few samples from the dataset. Currently it only supports image dataset
4. Multiple ways to split your data into -> train, test and validate
The split method comes up with multiple ways to split your data and here in this section using code I want to demonstrate those methods. This method can be quite handy when you want to run experiments with different sizes for train, validation and test.
- load returns a dictionary, use it to split your data
- load returns train and test data seperately
- load returns a dictionary based on the splits defined
To learn more about slicing and splitting refer the offical docs.
Link to all the code present can be found below :
Link to the github: https://github.com/Virajdatt/tensorflow_dataset_intro
Link to the colab: https://colab.research.google.com/drive/17Hrv6t2j39xxkLwJhQvjNB1I0JF0FAdw?usp=sharing#scrollTo=CegK_RO-KMgs
That’s it for this week, hope you guys had a good read. In the next week we will look at an end-end deep learning example: of how to load a dataset from tfds, do EDA, build and test our deep learning model with tensorflow and KERAS. We will be seeing an example for image classification and also an example for text-classification.
Please clap in case the content was helpful for you. You can reach me and talk with me on the following platform in case you have any questions or (mention your questions down in the comments below).
Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir