Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lab 1: Getting to Know Kaggle and Google Sheets



Lab 1: Getting to Know Kaggle and Google Sheets

Welcome to the first lab for Data Science Focused for Social Sciences! This lab focuses on introducing Kaggle, a resource for data sets, and how to analyze them using Google Sheets. We will be covering how to navigate through your system to download datasets and upload them into Google Sheets.

To do this lab, you will need a computer with local storage like a hard drive. Before we get started, it is important that you can access to both Kaggle and Google Sheets. You can use the links below to travel to those websites.


Part 1: Accessing Datasets through Kaggle

A dataset is a collection of information or data that data scientists can use to uncover patterns and make predictions. A dataset is very important as we share and store much information in datasets.

To begin this lab, use the link above to access the Kaggle website. Kaggle is a data science competition platform and online community for data scientists and machine learning practitioners, and it allows users to find and publish datasets. Once you are on the website, you’ll need to create a free Kaggle account. To create an account, look for the “Register” button located on the top right of the website and click it to begin registering. Once an account is registered, locate the datasets tab on the left side of the website.

Dataset tab

Once you have accessed the datasets tab, you can start searching for all the open source datasets Kaggle has. You can view all datasets Kaggle has by clicking the “All datasets” tab, or you can search for a specific dataset by using the search bar.

All Datasets tab
Search Datasets

You can use either option to start browsing for interesting datasets. Before you start searching for a dataset, there are some size restrictions for the dataset file that need to be considered. Computer files take space in the local storage of your computer, and the unit for digital information is called a byte. The common file sizes that are used are:

  1. Kilobyte (KB) - 1000 bytes

  2. Megabyte (MB) - 1000 KB

  3. Gigabyte (GB) - 1000 MB

  4. Terabyte (TB) - 1000 GB

Besides the size, there are other aspects of a dataset we must be mindful of:

  • Number of rows

  • Number of columns

    • Column Names

    • Type of data each column contains

This is important because Google Sheets has a 50 MB limit on files you can upload. With this knowledge, find a dataset that interests you and is under 50 MB (preferably closer to 10-15 MB). While searching for a dataset, keep these questions in mind:

  1. What is a question you would like to explore or answer using this dataset?

    • Think about something you are curious about or a pattern you would like to discover.

  2. What measurement(s) or variables from the dataset would help you answer your question?

    • Think about the specific columns/types of data you would need

  3. What is an ethical question or consideration related to analysing/using this dataset?

    • Where does the data come from?

    • How was the data collected/acquired?

    • Who benefits from the data collection?

    • Who might be harmed by the data collection or its use?

    • Is there consent for data collection?

    • What was the original purpose for collecting this data?

Once you have chosen a dataset that fits the criteria above, we can then start analyzing the dataset!


Part 2: Learning Kaggle’s User Interface

Most Kaggle datasets include a description explaining their purpose and contents. Once you have a dataset selected, you can start learning more about the dataset.


Understanding Kaggle Datasets

When you have selected a dataset, there is an “About Dataset” section that provides an overview and description of the dataset; scope, data collection methodology, and data quality standards. Here is an example:

Kaggle "About Dataset" example

If you scroll down a bit, you will see another tab named “Detail” tab that provides file type and data information.

Detail tab

The “Compact” tab provides an overview of the dataset as a relational database.

  • The left blue arrow describes the size of the dataset, and the right blue arrow shows the download icon for the dataset.

  • The green arrow points to a dropdown menu that displays the number of columns in the dataset.

  • The red arrow points to a drop down menu that allows you to sort the data by ascending/descending order. You can also choose a minimum and maximum value that you want to see in the dataset as well.

Compact tab
Compact Sorting Tab

The “Column” tab provides an overview of each column in the dataset. This tab also notes if the column has valid data (blue arrow) or invalid, mismatched, or missing data (red arrow).

Column tab

Lastly, there is a summary of the datset’s data type information on the right side of the webpage.

Summary of Dataset

You can learn most aspects of a dataset on Kaggle by looking under these sections for your particular dataset.


Review the Dataset Using the Compact and Column Tabs

Answer the following questions in the textbox following each question:

1. What is the name of your dataset? Provide a brief description of the dataset as well.

Type answer here

2. How many rows and columns does the dataset have?

Type answer here

3. What do the column names represent? Refer to Kaggle's description for context.

Type answer here

4. Are there patterns or trends you notice at a glance?

Type answer here

5. Can you identify any missing data or unusual values in the dataset?

Type answer here

6. What are the data values that are quantitative?

Type answer here

7. What are the data values that are qualitative (i.e. categorical)?

Type answer here


Analyze the Dataset

Answer the following questions based on what you observe:

  1. Identify two numerical columns and answer the following:

    • What do these numbers represent?

    • What range of values do they cover (for example, what are the minimum and maximum values)?

    • Identify one categorical column and list its unique categories or values.

Type answer here

  1. Consider the following questions:

    • What story does the dataset seem to tell?

    • Who might find this data useful, and for what purpose?

Type answer here


Part 3: Downloading and Importing a Dataset into Google Sheets

Once you have found an interesting dataset that satisfies the size criteria, you can download the dataset as a .csv file. To do this, click on the download button pointed out by the blue arrow on the “Compact” tab described in Part 2 of this lab.

Download Button Location

Once you have downloaded the dataset onto your computer, open up Google Sheets from the link at the beginning of the lab. Once you have signed into Google Sheets using your google email, create a blank spreadsheet by clicking on the “Blank spreadsheet” icon located on the top left of the Google Sheets page.

Blank Spreadsheet Location

Once you have created a new spreadsheet, you are going to want to import the dataset you downloaded from Kaggle. To do this, first look for the “File” tab in your untitled spreadsheet.

File Tab Location

Then, you are going to find the “Open” option under “File”. This will open your computer’s “Downloads” folder for any downloads you have made on your computer. You will want to select the dataset file that you just downloaded to open for the spreadsheet you have just created.

Downloads Folder Example

Once you have selected your dataset, your Google Spreadsheet should now be filled with information. Here is an example:

Imported Dataset Example

You have now uploaded a dataset into a Google Spreadsheet! Once you have imported your dataset into Google Spreadsheets, our next step is to create a sample of the big dataset to analyze.


Part 4: Sampling of the Dataset

A sample of a dataset is a partial count or measure of a dataset that is more often used in statistical studies. Normally, population datasets are too big to do analysis on, so we instead take a big enough sample of the dataset to represent the population we are studying.


Creating Random Sample

To sample from the dataset you have just imported into Google Sheets, you first need to randomize the rows in the dataset to ensure each data point has an equal chance of being chosen for the sample. To do this, you will need to highlight every row and 10 columns of the dataset in your spreadsheet.

Highlighting All Rows and 10 Columns Example

Next, You are going to want to shuffle the rows in your dataset. To do this, click on the “Data” tab and then select the “Randomize range” option.

Shuffling the dataset

Once the dataset is shuffled, you will want to create a new spreadsheet that will contain the sample of the dataset that you just shuffled. To do this, go to the “File” tab and select the “New” option. Then, create a new spreadsheet and name it “Sample - Name of your Dataset”.

Creating Spreadsheet for Sample

Once the new spreadsheet is created, copy over your sample of the dataset you are using. To do this, highlight 101 rows of the newly randomized dataset, including the header row and copy them. Head to your sample spreadsheet and paste the sample into it. Be sure to confirm that your header row and 100 randomized rows and 10 columns are in your sample spreadsheet.


Clean your Dataset

Datasets from Kaggle can potentially contain rows that have missing values. For this lab, we want to ensure the sample of our dataset has no rows or columns with missing values.

Once you have created your sample spreadsheet, please do the following:

  1. Verify you do not have rows or columns with missing values.

    • Replace any row with missing values with a new row from the randomized database spreadsheet.

  2. Verify data type format by column

    • Quantitative columns should only have numeric data types

    • Qualitative / categorical columns should have values algined with the columns values

    • Replace any row with formatting inconsistencies with a new row from the randomized database spreadsheet


Summarize your Findings

Once your dataset is clean and rid of any missing values or inconsistencies, answer the following questions:

1. What dataset did you select and why?
  • What are 2 questions you would like to explore or answer using this dataset?

Type answer here

2. What measurement(s) or variables from the dataset would help you answer your question?
  • Note any interesting patterns or insights you observed.

Type answer here

3. What is an ethical question or consideration related to analysing/using this dataset?

Type answer here

4. Any questions or challenges you encountered while exploring your dataset?

Type answer here


Once you have answered all questions, you are going to want to create a link to your sample spreadsheet. To do this, click the “Share” button on the top right hand corner of your sample spreadsheet

Share button

Once you click the share button, there will be a pop up window that has a section called “General access”. You will want to make sure that anyone with the link can be a viewer of your sample spreadsheet, so make sure the button on the right of the “Anyone with the link” section is set to “Viewer”.

Set permission of shareable link to "Viewer"

Once that is done, copy the link and paste it in the textbox below. Make sure that the sheet is viewable by clicking on the link after you have copied it.

Paste link here

And that is the end of Lab 1! The next lab will focus on beginning Python practices and how to use python to analyze your sample spreadsheet!