Predicting Ecommerce with Google Data
Predicting Ecommerce with Google Data
11 minute read
Predicting Ecommerce with Google Data
Here is what we will cover.
- What is the business problem?
- What is the data?
- JSON Manipulation
What is the business problem?
The business problem is that we want to predict whether a user will make a purchase or not.
Data was discovered using Google Bard and parf of BigQuery. For this exercise I exported the data as a compressed JSON file. After downloading the file I unzipped it and uploaded it to Databricks. The schema looks like this:
root
|-- channelGrouping: string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- index: string (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion: string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |-- isMobile: boolean (nullable = true)
| |-- language: string (nullable = true)
| |-- mobileDeviceBranding: string (nullable = true)
| |-- mobileDeviceInfo: string (nullable = true)
| |-- mobileDeviceMarketingName: string (nullable = true)
| |-- mobileDeviceModel: string (nullable = true)
| |-- mobileInputSelector: string (nullable = true)
| |-- operatingSystem: string (nullable = true)
| |-- operatingSystemVersion: string (nullable = true)
| |-- screenColors: string (nullable = true)
| |-- screenResolution: string (nullable = true)
|-- fullVisitorId: string (nullable = true)
|-- geoNetwork: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- cityId: string (nullable = true)
| |-- continent: string (nullable = true)
| |-- country: string (nullable = true)
| |-- latitude: string (nullable = true)
| |-- longitude: string (nullable = true)
| |-- metro: string (nullable = true)
| |-- networkDomain: string (nullable = true)
| |-- networkLocation: string (nullable = true)
| |-- region: string (nullable = true)
| |-- subContinent: string (nullable = true)
|-- hits: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- appInfo: struct (nullable = true)
| | | |-- exitScreenName: string (nullable = true)
| | | |-- landingScreenName: string (nullable = true)
| | | |-- screenDepth: string (nullable = true)
| | | |-- screenName: string (nullable = true)
| | |-- contentGroup: struct (nullable = true)
| | | |-- contentGroup1: string (nullable = true)
| | | |-- contentGroup2: string (nullable = true)
| | | |-- contentGroup3: string (nullable = true)
| | | |-- contentGroup4: string (nullable = true)
| | | |-- contentGroup5: string (nullable = true)
| | | |-- contentGroupUniqueViews1: string (nullable = true)
| | | |-- contentGroupUniqueViews2: string (nullable = true)
| | | |-- contentGroupUniqueViews3: string (nullable = true)
| | | |-- previousContentGroup1: string (nullable = true)
| | | |-- previousContentGroup2: string (nullable = true)
| | | |-- previousContentGroup3: string (nullable = true)
| | | |-- previousContentGroup4: string (nullable = true)
| | | |-- previousContentGroup5: string (nullable = true)
| | |-- customDimensions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- customMetrics: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- customVariables: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- dataSource: string (nullable = true)
| | |-- eCommerceAction: struct (nullable = true)
| | | |-- action_type: string (nullable = true)
| | | |-- option: string (nullable = true)
| | | |-- step: string (nullable = true)
| | |-- eventInfo: struct (nullable = true)
| | | |-- eventAction: string (nullable = true)
| | | |-- eventCategory: string (nullable = true)
| | | |-- eventLabel: string (nullable = true)
| | |-- exceptionInfo: struct (nullable = true)
| | | |-- isFatal: boolean (nullable = true)
| | |-- experiment: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- hitNumber: string (nullable = true)
| | |-- hour: string (nullable = true)
| | |-- isEntrance: boolean (nullable = true)
| | |-- isExit: boolean (nullable = true)
| | |-- isInteraction: boolean (nullable = true)
| | |-- item: struct (nullable = true)
| | | |-- currencyCode: string (nullable = true)
| | | |-- transactionId: string (nullable = true)
| | |-- latencyTracking: struct (nullable = true)
| | | |-- domContentLoadedTime: string (nullable = true)
| | | |-- domInteractiveTime: string (nullable = true)
| | | |-- domLatencyMetricsSample: string (nullable = true)
| | | |-- domainLookupTime: string (nullable = true)
| | | |-- pageDownloadTime: string (nullable = true)
| | | |-- pageLoadSample: string (nullable = true)
| | | |-- pageLoadTime: string (nullable = true)
| | | |-- redirectionTime: string (nullable = true)
| | | |-- serverConnectionTime: string (nullable = true)
| | | |-- serverResponseTime: string (nullable = true)
| | | |-- speedMetricsSample: string (nullable = true)
| | |-- minute: string (nullable = true)
| | |-- page: struct (nullable = true)
| | | |-- hostname: string (nullable = true)
| | | |-- pagePath: string (nullable = true)
| | | |-- pagePathLevel1: string (nullable = true)
| | | |-- pagePathLevel2: string (nullable = true)
| | | |-- pagePathLevel3: string (nullable = true)
| | | |-- pagePathLevel4: string (nullable = true)
| | | |-- pageTitle: string (nullable = true)
| | | |-- searchCategory: string (nullable = true)
| | | |-- searchKeyword: string (nullable = true)
| | |-- product: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- customDimensions: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- customMetrics: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- isClick: boolean (nullable = true)
| | | | |-- isImpression: boolean (nullable = true)
| | | | |-- localProductPrice: string (nullable = true)
| | | | |-- localProductRevenue: string (nullable = true)
| | | | |-- productBrand: string (nullable = true)
| | | | |-- productListName: string (nullable = true)
| | | | |-- productListPosition: string (nullable = true)
| | | | |-- productPrice: string (nullable = true)
| | | | |-- productQuantity: string (nullable = true)
| | | | |-- productRevenue: string (nullable = true)
| | | | |-- productSKU: string (nullable = true)
| | | | |-- productVariant: string (nullable = true)
| | | | |-- v2ProductCategory: string (nullable = true)
| | | | |-- v2ProductName: string (nullable = true)
| | |-- promotion: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- promoCreative: string (nullable = true)
| | | | |-- promoId: string (nullable = true)
| | | | |-- promoName: string (nullable = true)
| | | | |-- promoPosition: string (nullable = true)
| | |-- promotionActionInfo: struct (nullable = true)
| | | |-- promoIsClick: boolean (nullable = true)
| | | |-- promoIsView: boolean (nullable = true)
| | |-- publisher_infos: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- referer: string (nullable = true)
| | |-- social: struct (nullable = true)
| | | |-- hasSocialSourceReferral: string (nullable = true)
| | | |-- socialInteractionNetworkAction: string (nullable = true)
| | | |-- socialNetwork: string (nullable = true)
| | |-- time: string (nullable = true)
| | |-- transaction: struct (nullable = true)
| | | |-- affiliation: string (nullable = true)
| | | |-- currencyCode: string (nullable = true)
| | | |-- localTransactionRevenue: string (nullable = true)
| | | |-- localTransactionShipping: string (nullable = true)
| | | |-- localTransactionTax: string (nullable = true)
| | | |-- transactionId: string (nullable = true)
| | | |-- transactionRevenue: string (nullable = true)
| | | |-- transactionShipping: string (nullable = true)
| | | |-- transactionTax: string (nullable = true)
| | |-- type: string (nullable = true)
|-- socialEngagementType: string (nullable = true)
|-- totals: struct (nullable = true)
| |-- bounces: string (nullable = true)
| |-- hits: string (nullable = true)
| |-- newVisits: string (nullable = true)
| |-- pageviews: string (nullable = true)
| |-- sessionQualityDim: string (nullable = true)
| |-- timeOnSite: string (nullable = true)
| |-- totalTransactionRevenue: string (nullable = true)
| |-- transactionRevenue: string (nullable = true)
| |-- transactions: string (nullable = true)
| |-- visits: string (nullable = true)
|-- trafficSource: struct (nullable = true)
| |-- adContent: string (nullable = true)
| |-- adwordsClickInfo: struct (nullable = true)
| | |-- adNetworkType: string (nullable = true)
| | |-- criteriaParameters: string (nullable = true)
| | |-- gclId: string (nullable = true)
| | |-- isVideoAd: boolean (nullable = true)
| | |-- page: string (nullable = true)
| | |-- slot: string (nullable = true)
| |-- campaign: string (nullable = true)
| |-- isTrueDirect: boolean (nullable = true)
| |-- keyword: string (nullable = true)
| |-- medium: string (nullable = true)
| |-- referralPath: string (nullable = true)
| |-- source: string (nullable = true)
|-- visitId: string (nullable = true)
|-- visitNumber: string (nullable = true)
|-- visitStartTime: string (nullable = true)
Selecting a subset of the data
Given that the business use case, we want a dataset that looks like this:
root
|-- visit_date_dimension_identifier: string (nullable = true)
|-- visitnumber: long (nullable = true)
|-- visitid: long (nullable = true)
|-- visitstarttime: long (nullable = true)
|-- fullvisitorid: string (nullable = true)
|-- hitnumber: long (nullable = true)
|-- index: long (nullable = true)
|-- value: string (nullable = true)
|-- base_website_name: string (nullable = true)
|-- hostname: string (nullable = true)
|-- browser: string (nullable = true)
|-- devicecategory: string (nullable = true)
|-- operatingsystem: string (nullable = true)
|-- medium: string (nullable = true)
|-- action_type: string (nullable = true)
|-- pagepath: string (nullable = true)
|-- isentrance: boolean (nullable = true)
|-- isexit: boolean (nullable = true)
|-- channelgrouping: string (nullable = true)
|-- bounces: long (nullable = true)
|-- hitseventcat: string (nullable = true)
|-- hitseventaction: string (nullable = true)
|-- contentgroup2: string (nullable = true)
|-- hitseventlabel: string (nullable = true)
|-- transactionid: string (nullable = true)
|-- bounce_indicator: integer (nullable = true)
|-- total_hits: long (nullable = true)
|-- total_page_views: long (nullable = true)
|-- total_session_cart_amount: long (nullable = true)
|-- analytics_uid: string (nullable = true)
|-- web_geo_network_city: string (nullable = true)
|-- web_geo_network_metro_area: string (nullable = true)
|-- web_geo_network_country: string (nullable = true)
|-- web_geo_network_state: string (nullable = true)
|-- eventcategory: string (nullable = true)
|-- filename: string (nullable = true)
So let’s line these up in a table:
SNo | Original Column Name | New Column Name | Sources and Notes |
---|---|---|---|
1 | visit_data_dim_id | visit_date_dimension_identifier | Ensure that date type is right. Also may need to use a different uid |
2 | visitNumber | visit_number | This is the number of the visit. |
3 | visitId | visit_id | This is the unique identifier for the visit. |
4 | visitStartTime | visit_date_dimension_identifier | Ensure that date type is right. Also may need to use a different uid |
5 | fullVisitorId | fullvisitorid | This is the unique identifier for the visitor. |
6 | hits.hitNumber | hit_number | This is the number of the hit. |
7 | customDimensions.elements.index</mark | custom_dim_value | This is the value of the custom dimension. |
9 | hits.page.page | base_website_name | This is the page of the page. |
10 | hits.page.hostname | hostname | This is the hostname of the page. |
11 | device.browser | browser | This is the browser of the device. |
12 | device.deviceCategory | device_category | This is the device category. |
13 | device.operatingSystem | operating_system | This is the operating system of the device. |
14 | trafficSource.medium | medium | This is the medium of the traffic source. |
15 | hits.eCommerceAction.actionType | action_type | This is the action type of the ecommerce action. |
16 | hits.page.pagePath | pagepath | This is the page path of the page. |
17 | hits.isEntrance | isentrance | This is the entrance indicator of the hit. |
18 | hits.isExit | isexit | This is the exit indicator of the hit. |
19 | channelGrouping | channelgrouping | This is the channel grouping. |
20 | totals.bounces | bounces | This is the bounces. |
21 | hits.eventInfo.eventCategory | hitseventcat | This is the event category of the event info. |
22 | hits.eventInfo.eventAction | hitseventaction | This is the event action of the event info. |
23 | hits.contentGroup.contentGroup2 | contentgroup2 | This is the content group 2 of the content group. |
24 | hits.eventInfo.eventLabel | hitseventlabel | This is the event label of the event info. |
25 | hits.transaction.transactionId | transactionid | This is the transaction id of the transaction. |
26 | totals.bounces | bounce_indicator | This is the bounce indicator. |
27 | totals.hits | total_hits | This is the total hits. |
28 | totals.pageviews | total_page_views | This is the total page views. |
29 | totals.transactionRevenue | total_session_cart_amount | This is the total session cart amount. |
30 | customDimensions.elements.value | analytics_uid | This is the analytics uid. |
31 | geoNetwork.city | web_geo_network_city | This is the web geo network city. |
32 | geoNetwork.metro | web_geo_network_metro_area | This is the web geo network metro area. |
33 | geoNetwork.country | web_geo_network_country | This is the web geo network country. |
34 | geoNetwork.region | web_geo_network_state | This is the web geo network state. |
35 | hits.eventInfo.eventCategory | eventcategory | This is the event category. |
36 | filename | filename | This is the filename. |
JSON Manipulation
The first thing we need to do is to flatten the JSON. We can do this using the following code: