Part three of four of our Machine Learning – Big Query discussion focuses on various scenarios for BigQuery ML and where it is a good fit for ML projects. In addition, we’ll go over which models make a better fit, or if a range of model choices are the best approach.

Meet the Speakers

Jared Burns

Data Science Engineer at Agosto

Mark Brose

Vice President of Engineering at Agosto

Transcript

– BigQuery ML really fits in there as that good baseline strategy because you don’t need to move data anywhere. You can simply just execute SQL right in the BigQuery UI, and get a good understanding of what kind of results and accuracy you can expect.

– So today we’re going to talk about BigQuery ML. Jared, tell us a little bit about BigQuery ML, maybe a little bit about what’s a good scenario for BigQuery ML? Where is it a good fit for your ML projects?

– Yeah, I think it fits in right at the point where, let’s say that you want to develop a model very quickly, and just get a baseline approach for how good a model you can get. Sometimes you want to be able to get really custom models, but other times you might just want to be able to get an understanding of what’s an expectation or baseline of how well a model can predict whatever it is I’m trying to do. And I think BigQuery ML really fits in there as that good baseline strategy, because you don’t need to move data anywhere. You can simply just execute SQL right in the BigQuery UI, and get a good understanding of what kind of results and accuracy you can expect.

– Are there models that are, better types of models that better fits, you think, for BigQuery ML, or should fit into this range of model choices? Or how do you think about that?

– Yeah, so right out of the gate, there are several models that are available to you to use right away. So there’s linear regression, classification or logistic regression, k-means clustering. There is matrix factorization that is just in GA now, so that’s basically if you want to get product recommendations or movie recommendations, that’s the kind of framework that that’s for. You can also import TensorFlow models to use in BigQuery ML as well. So really, kind of a full range of models.

– Cool. Yeah, let’s dig in a little. Do you have some you can show us?

– Yeah, absolutely. So I’m in the BigQuery UI right now, and I have some sample data prepared, which is a retail data set based on some sales data and then also some demographics about customers. And let’s say that I wanted to get an idea of how to categorize or segment my customer base. You might want to do this for a number of reasons. For email campaigns, to understand different cohorts of your customers, really putting customers in separate journeys in your customer lifetime value. So really, that’s really where k-means clustering, it comes in handy. So if I wanted to develop a k-means clustering model using BigQuery ML, I can simply utilize a lot of the same SQL syntax that I would use to generate a table. So in this case, all you need to do is say “create.” If you already have a model that you’re writing over, you would specify that you want to replace it as well. So in this case, let’s create a model named Clustering Model and then it gives you an option functionality, where you can pass in several options into your BigQuery ML setup. So in this case, we have to specify the model type. In our case, it’s k-means, but you can also specify linear regression or classification. If you’re importing a TensorFlow model, you would say “TensorFlow” here. And then also, matrix factorization. You have to know what kind of model you’re looking for, and that comes with the business understanding of what kind of prediction that I want to get out of this. Do you want to predict classes of something, or do you want to predict raw numerical values? So you have to have a little big of understanding about the background of what you’re trying to predict in order to start using this to its full advantage. And then it gives you several of these, what are called “hyper parameters.” Hyper parameters are really configurable options in your model that you don’t know ahead of time what will produce the best model, so you need to try various different methods and ranges in order to provide you with the best model. Since this is an unsupervised modeling approach, which means we’re not trying to predict a numerical value or a class, we’re just trying to group our values into similar groups. So there isn’t really a raw prediction value we’re trying to predict, but we still get measures of quality in terms of how well the clusters are fitting together. So we still have some configuration that we can do. And you can do that using, in BigQuery, something like a script or if you utilize, if you’re a Python user, you can utilize a Python for-loop to try and iterate over several different options. But typically, you can specify your options and pass them in this way as well. So in this case, I want to specify that the number of clusters is four. So this will assign every single value in this data set to one of the four clusters, and it’s going to be doing it based on the Euclidean distance.

– So if you’re doing this, you may, as you’re building this, you may experiment with different cluster sizes. You may iterate on this a few times, right, as you’re doing this?

– Yeah, if you’re trying to fine-tune this model, you would want to try a lot of different cluster sizes, and maybe even change the distance size. And you would have to set that up as a script. You can do that in BigQuery or you can rely on opensource tools like Python.

– Yeah. What’s better about doing it, you think, in BigQuery than doing it in a more traditional way, without BigQuery? What’s the advantage here?

– So one of my favorite features about BigQuery ML is that you can just specify standardized features equal to true, and that standardizes your input to be on a normalized scale. That’s something that you typically have to do in a Python open source by specifying, calling it and then saving that when you want to generate your predictions in real time, too. But in this case, we can just specify that in the options and BigQuery will do that for us. And then additionally, we can pass in several transformations. So let’s say that we wanted to generate our model from a transformation on our input data. We can also pass in transformations into here, too. And the good thing about this is that it remembers those transformations, so that when we call predictions the next time, it’s going to do those same transformations again in order to serve predictions there. So it keeps track of all of those transformations. So that’s something really powerful about using this BigQuery ML approach, is that it remembers and it keeps track of all of those transformations for you. Let’s say that we run this and we have our model, it gives you all of the useful information that you care about in evaluating your model. So it gives you a breakdown of how many iterations your model went and the loss of each respective iteration. And then it also gives you a breakdown of all of your clusters and how the input features vary across the clusters. So you can get a really good sense of which features are important for each cluster.

– Great, so now as we built this out, when we get to the point where we’ve evaluated and we’re good with this, where do we take it from there? Can we serve it up here as well?

– Yeah, so once we say “create or replace model,” that model is saved for us, just like a table is saved in BigQuery. So once you save it as a model, it’s ready to serve predictions for you in real time. All we have to do is provide it the same input features and we can get those predictions back.

– We can do the whole process right here in the BigQuery console, right?

– Yep, yeah, there’s no need to move data from BigQuery to a Jupyter Notebook, for example.

– But then if I want to get the model out of here and say, run it in production somewhere else, do I have that option as well?

– You can, yeah. BigQuery just announced this week that you can download your models from classification, linear regression, logistic regressions. You can download those from saved model objects in TensorFlow and then use those models willfully, if you want to, as well.

– Really powerful. Thanks, Jared.

– Thanks.