BQ is a datawarehouse primarily used by the BI, analysts. It is not extensively used by developers as it is not a database that has CRUD ops. So a datawarehouse usually connects with a BI tool or a data analytics interface or a pipeline. A datawarehouse is particularly useful for all data ops which are to be performed across channels, CRM metrics, external datapoints and pretty much what you wish to use in the future, one can throw that into the datawarehouse.
BigQuery is the datawarehouse service which is offerred by GCP and has extensive abilities to schedule pipelines and automate data analysis. The tool kit helps you become a better data driven marketer, conduct analysis that you never thought could do before, learn more about GCP and its possibilities and explore machine learning as well in the time to come, consolidate all data @ one place.
With the advent of microservices and google products that blend so well with each other, its fairly easy to consume these services and make an optimal model to make better strategic, data driven decisions. An example of this case scenario that i would like to describe is using social media channels and an aggregating tool called social insider to capture as many data points as possible using R and its packages such as httr, plumbr, rlist to get the response and massage the data into a clean dataset using some native functions.
After this process was done I was able to push this dataset to Bigquery projects using the following structure which is the convention for BQ projects.
Now when datasets across so many channels are flowing in to your datawarehouse, there’s a requirement of creating the right schema with primary and foreign keys across tables. This is a long term process which has to be decided before any kind of extract and load process is done. Conventionally the ideal process was to extract load and transform but with the rapid evolution of toolkit the process has changed from extract transform and then load. The idea behind this is that as much clean and normalised data should be stored in a datawarehouse without messing with the underlying tables and schema design.
The cloud gives flexibility, scalability and a lot of customisability along with resilience and recognition. We live in this time where cross cloud connections are possible , i.e an AWS service can connect to a GCP service, that’s the best part about cloud and their serviceability. GCP offers so many services across storage, compute, Bigdata and Ml. One can use each of the services to load data into the datawarehouse. For e.g cloud dataflow is used for streaming pipeline and has some specific use cases such as an IoT pipeline or streaming data services. The other process is batch pipeline because APIs take time to update a batch process is ideally a good way to interact with Warehouses for marketing needs(as stated in the video), i don’t agree to that(leaving it out for a personal discussion).
The typical use described is from cloud storage to bigquery with some simple GUI based configurations after table design and uploading files from local machine. This is a simple use case, however atypically one would have to access BQ via a service account with the right permissions and and pushing the desired data frames and tables to BigQuery.
Now data in bigquery is almost identical to a row coloumn format but it uses partition tables which makes the querying process speed up, i.e the underlying usp of bigquery to ramp up querying and speeding up process.
It gives you an interface to programatically connect to it to push data and retrieve records that you like. I would recommend the CXL bigquery course to get an overview of Bigquery and its usecases.