What is Data Loading and loading data into Google BigQuery ?

Data loading is the process of copying and loading data sets or data from a source file or an application to a cloud data warehouse. You must first load your data in to any cloud data warehouse so that data teams can analyse data and provide insights.

Google Cloud BigQuery

In this tutorial you would learn how to load data in to Google cloud data warehouse like BigQuery.

BigQuery is a serverless, highly-scalable and cost-effective data warehouse with built-in business intelligence engine and artificial intelligence platform.

Google BigQuery say’s that you just load your data into BigQuery and let google cloud platform handle all the hardwork. BigQuery an be accessed either by using Console or WebUI or command-line tool. Let us see an example how to load/move data in to BigQuery.

Data loading into BigQuery

You can use bq, the python based command line tool for BigQuery to load sample data into BigQuery and perform some queries on the data sets/tables.

Pre-requisites – Ensure you have your own GCP account or project for this exercise.

BigQuery offers a number of sample tables that you can run queries against or you can create/upload your own dataset. The below example shows you step by step process to load data in to BigQuery.

Step 1: Create new dataset

Create a new dataset named babynames in your GCP project

[email protected]:~ (sneppets-gcp)$ bq mk babynames
Dataset 'sneppets-gcp:babynames' successfully created.

[email protected]:~ (sneppets-gcp)$ bq ls
  datasetId
 -----------
  babynames

Step 2: Download the data source

In BigQuery before you load the data into table, first you need to add dataset to your project. We are going to use this babynames zip file as data source which contains approximately 7 MB of data about popular baby names, provided by the USSSA.

First download the source file in local directory using wget as shown below.

[email protected]:~ (sneppets-gcp)$ wget http://www.ssa.gov/OACT/babynames/names.zip
URL transformed to HTTPS due to an HSTS policy
--2019-11-17 19:04:22--  https://www.ssa.gov/OACT/babynames/names.zip
Resolving www.ssa.gov (www.ssa.gov)... 137.200.4.16, 2001:1930:e03::aaaa
Connecting to www.ssa.gov (www.ssa.gov)|137.200.4.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7200451 (6.9M) [application/zip]
Saving to: ‘names.zip’
names.zip                           100%[========================================================================>]   6.87M   282KB/s    in 27s 
2019-11-17 19:04:50 (263 KB/s) - ‘names.zip’ saved [7200451/7200451]

[email protected]:~ (sneppets-gcp)$ ls
names.zip

Now unzip the file. You could see big list of files! List them and check once using ls command

[email protected]:~ (sneppets-gcp)$ unzip names.zip

[email protected]:~ (sneppets-gcp)$ ls

Step 3: Upload/Load the dataset

This is the most important step, you have to use bq load command which is used to create or update table and load all the data in single step.

The bq load arguments you will be running are:

datasetID: babynames
tableID: names2010
source: yob2010.txt
schema: name:string,gender:string,count:integer

Now create your table using the following command

[email protected]:~ (sneppets-gcp)$ bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer

Upload complete.
Waiting on bqjob_r45d9b3ed11bed690_0000016e799a04a1_1 ... (0s) Current status: DONE

You can run the following command to check whether table names2010 appears now in your dataset babynames

[email protected]:~ (sneppets-gcp)$ bq ls babynames
   tableId    Type    Labels   Time Partitioning   Clustered Fields
 ----------- ------- -------- ------------------- ------------------
  names2010   TABLE

Step 4: Run Queries

Now you can query the data in a way you want and check the results. For instance, let us run the following command to return top 5 most popular boys names

bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' 
ORDER BY count DESC LIMIT 5"
[email protected]:~ (sneppets-gcp)$ bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' ORDER BY count DESC LIMIT 5"
Waiting on bqjob_r3a7870c4e97967c1_0000016e799f72b8_1 ... (0s) Current status: DONE   
+---------+-------+
|  name   | count |
+---------+-------+
| Jacob   | 22127 |
| Ethan   | 18002 |
| Michael | 17350 |
| Jayden  | 17179 |
| William | 17051 |
+---------+-------+

Clean up

You can run bq rm -r command to remove babynames dataset and all tables in that dataset from your GCP project.

bq rm -r babynames

Further Learning

References

Leave a Reply

avatar
  Subscribe  
Notify of