Constructing an ELT pipeline utilizing Google Sheets as an middleman
BigQuery provides the flexibility to shortly import a CSV file, each from the net consumer interface and from the command line:
Limitations of autodetect and import
This works on your plain-vanilla CSV information, however can fail on complicated CSV information. For example of a file it fails on, let’s take a dataset of New York City Airbnb rentals data from Kaggle. This dataset has 16 columns, however one of many columns consists of just about free-form textual content. Which means that it might include emojis, new line characters, …
Certainly, attempt to open this file up with BigQuery:
and we get the errors like:
It’s because a row is unfold throughout a number of strains, and so the beginning quote on one line isn’t closed. This isn’t a simple drawback to resolve — numerous instrumentsstruggle with CSV information which have new strains inside cells.
Sheets to the rescue
Google Sheets, alternatively, has a a lot better CSV import mechanism. Open up a Google Sheet, import the CSV file and voila …
The cool factor is that through the use of a Google Sheet, you are able to do interactive knowledge preparation within the Sheet earlier than loading it into BigQuery.
First, delete the primary row (the header) from the sheet. We don’t need that in our knowledge.
ELT from a Google Sheet
As soon as it’s in Google Sheets, we will use a useful little trick — BigQuery can straight question Google Sheets! To do this, we outline the Google Sheet as a desk in BigQuery:
Steps from the BigQuery UI
- Choose a dataset and click on on Create Desk
- Choose Drive because the supply, specify the Drive URL to the Google Sheet
- Set Google Sheet because the file format
- Give the desk a reputation. I named it airbnb_raw_googlesheet
- Specify the schema:
This desk doesn’t copy the information from the sheet — it queries the sheet reside.
So, let’s copy the information as-is into BigQuery (in fact, we might do some transformation right here as effectively):
Find out how to automate
You possibly can automate these steps:
- Right here’s an article on how one canread a CSV file into Sheets utilizing Python
- From then on, usedataform.co or BigQuery scripts to outline the BigQuery desk and do the ELT.
To import complicated CSV information into BigQuery, construct an ELT pipeline utilizing Google Sheets as an middleman. This lets you deal with CSV information with new strains and different particular characters within the columns. Get pleasure from!