Part two of our conversation around Machine Learning and Big Query continues with a deeper dive in AutoML Tables with Big Query. We’ll go over good use cases for it, how it works and how “auto” it truly is.

Meet the Speakers

Jared Burns

Data Science Engineer at Agosto

Mark Brose

Vice President of Engineering at Agosto

Transcript

– So AutoML Tables is a tool for, it’s- it’s almost fully automated in terms of being able to have all of those feat- those features of your ML machine learning steps being fully automated for you. So things like feature engineering, hyper-parameter tuning, all of that stuff handled for you behind the scenes by Google AutoML Tables. Today we’re gonna talk about AutoML Tables with BigQuery. Here, let’s take a little deep dive into AutoML Tables with BigQuery. Curious to see kinda what are- what are the good use cases for it, how’s it work, how easy is it really to use, how auto is it? So let’s just maybe dig in a little bit and talk about maybe first, like what are what is AutoML Tables, and what are some good, like, use cases for it. Where does it fit?

– Yeah, so AutoML Tables is a tool for, it’s almost fully automated in terms of being able to have all of those features of your ML machine learning steps being fully automated for you. So things like feature engineering, hyper-parameter tuning, all of that stuff is handled for you behind the scenes by Google AutoML Tables, and all you have to do, really, is essentially point your, point the UI at a particular BigQuery Table in order to do that.

– Yeah, and if you’re able to show us, let’s dig in and take a look at how it works.

– Yeah, let’s jump in and show AutoML Tables.

– What do you got here? Looks like some data, how do we, you know, how do we use this?

– Yeah, so from the Google GCP, Hamburger Stack, they call it, right on the table’s UI, you have the option of Data Sets and Models and Data Sets is where you import your data, and you give it a reference, in this case, we’re using a public data set for fraud detection, and it efficiently reads your- your data using the underlying BigQuery storage API which reads a lot- data a lot faster than if you were using, for example, the Python API. So in this case, this particular data set has a lot of features in it that are extracted away from what the actual variables are just because this is a very sensitive data set, with some PII in it, so the curators of this data set didn’t want people utilizing this for in order to see PII data. So, it’s not important for us but just know that there are actual features behind the scenes here. But they’re all represented as variable 1 through- through 20 something. But in the- in the UI, so once we import our data, it gives you a nice summary of we have 31 columns, 284,000 rows, it gives you a breakdown of the numeric versus the categorical features, in our case, that one categorical feature is our target column. That we specify up here, it’s our class, so the Class is whether or not someone- the transaction is fraud or not. It’s a really highly imbalanced class so in machine learning terms, that means that there’s very few fraudulent transactions out there, as compared to the all of the other valid transactions out there, and then right out of the box, without having to do anything, AutoML Tables gives you this really nice breakdown, showing you the, you know, whether the feature is nullable or not, the type- data type of the column the presets that are missing, as well as invalid, the- the distinct value, so this is helpful for understanding if features, if you only have a few features that- that don’t vary very much, you might wanna consider not including that into your model, and then you also have the correlation with- the target breakdown here. So this shows the degree to which how correlated the input feature is to the target that you specify. So right out of the gate, this takes a lot of the leg work out of your typical ML workflow by just breaking down all of these- these features here for you. This is all before I’ve even trained the model, so if I go and click on Train Model, it will give you a budget, so this is basically you enter in the amount of node-hours you want AutoML Tables to spend on trying to optimize and find the best model. So once you provide that and you select your- your features that are in your model, you click Train Model, it’ll run for several hours, and it’ll give you an email notification once your model is ready. In this case, I’ve already trained this model so we don’t have to go through that, but- but it’ll give you, this is the basic summary of what you provided, once your model is done training. So it provides you things like the area under the Precision Recall Curve, your accuracy, and your log loss, and then we can click on this to see a further breakdown of how the model performed.

– So how do you, what do you think about this? Does it looks- it look pretty good?

– It does look pretty good, yes. Although this is a highly imbalanced data set, so you have to judge it in terms of- of how well with the, or how balanced the data set is. So the results look good, but it’s also somewhat deceiving because the- you know in the, in the raw data, there’s only a very few level of fraudulent transactions. But we can get a very accurate model, simply by just saying, guessing that every single transaction was non-fraudulent. So you always have to judge it on that- that base line of how balanced your data set is.

– Sure. That does a lot for you, but you still need to understand a little bit about what you’re tryin’ to do, and what the outcomes are gonna be, right?

– Yeah, exactly! Some cool features in this, Is it, this- this gives you this slider for the score threshold. Your Score Threshold is your value of- of your prediction results, and it’s what the model will coordinate to the Cloud. So, let’s say that we have a sets of point type three now, any prediction value above point type three is gonna be a prediction of fraudulent, and anything below that is gonna be non-fraudulent. Well, maybe we wanna be able to customize this and set the threshold value to anything we want, we can do that in the UI here using this- this tool and it ‘ll update your results in real time as you- as you move this slider back and forth.

– Yeah, what- what next? You know, we wanna use this in our you know- maybe an application or just- you know, somebody wants to go use to queries and data, like how do you, how do you then leverage this to- to take this to production?

– Yeah, so the next thing I would say is- is you have your- your charger of your feature imports, maybe you wanna go back and- and retrain your model with, after removing a lot of these- these features that aren’t as important. Your- your variable 15 came in as the most important by far, so maybe you wanna go back and retrain this model to remove some of those non-important features. But otherwise, if you’re ready, if you’re good to go with your model, and you wanna just- just use it, there’s several ways that you can- that you can use your model, you can use a, generate a Vast Prediction. That’s basically what you wanna do if you wanna just point your- your model at a new data set of BigQuery and you wanna get predictions on that data, let’s say you have a whole new day’s worth of transactions, that you wanna get- you wanna get predictions for, here’s where you would come in and- and do that. You can also get online predictions, which will be a- a little bit more expensive, but that will allow you to get predictions in real time. So you can see in- in this particular use case, when you’re trying to- at the- at the point of sale, you’re try’na predict whether or not a transaction is fraudulent or not, this will be really valuable to have. To have this available in real time, and then you could also export your model. Let’s say you wanna be able to serve your model locally on a device, this is where you can go and do that.

– Yeah, however you do your model, the hosting today, right? So you can export, this cuts your flow and run it anywhere you want to, right? You can run it in your own container DMs, it’s kinda whatever works for you right?

– Yep.

– That’s awesome. Thanks, thanks for overview.

– Yeah.