Python Google Analytics API Over 1 mil Rows Unsampled Data + Pull Data From Multiple Profiles

How to pull large data sets with over 1 million rows from Google Analytics and avoid sampling?
How to pull the same data across multiple Google Analytics profiles?

google_analytics_python_api_1_million_rows
This post provides a solution for exporting more than 10,000 rows (the example pulls over 1 millions rows) and a solution for the sampling limitations of Google Analytics. The solution uses the Google Analytics reporting API and Python. It checks for the presence of sampling before running a query and gives you the ability to break your query down into smaller 10,000 row pieces. The pieces are multiple smaller queries with shorter date ranges. All the data from the small queries are stitched together and output into a single CSV file for the full date range. With this solution you can also pull data from multiple profiles using this single Python application. If you work across multiple Google Analytics profiles with high traffic volumes and often run into sampling, then this solution should save you lots of time. If you’d like to see how to use the Google Analytics Sheets Add-on to pull more than 10k rows of data and avoid sampling check out my previous post.

Python Google Analytics Reporting Setup

To get started you’ll need to get setup with Google Analytics API Python client library. If you have not worked with the Google Analytics API and Python before, go through the step by step Python Google Analytics tutorial from my previous post first to get up and running. No prior experience with Python or the Google Analytics API is required for you to get started. Once you have Python setup and have pulled your first query from the Google Analytics API you are ready to get started with this solution.

Python Google Analytics Solution to Export more than 10,000 Rows and Avoid Sampling

This Python Solution will allow you to pull more than 10000 row of Google Analytics data (over 1 million rows in this example) and will automatically combine the 10,000 row chunks of data and output the data in a CSV file. When pulling large amounts of data or trying to create ad-hoc nonstandard reports in Google Analytics you have likely run into sampling. This solution allows you to break your queries down into shorter date ranges to avoid sampling.

Download the Python File

1) Make sure to download or save the Python file from above in the analytics samples folder on your computer within the Google Analytics API Python Client Master folder. My path is C:\ryanpraski\Desktop\google-api-python-client-master\samples\analytics

2) Find profile_ids= {‘ryanpraski’; ‘1234567’} on line 162 of the file and enter your profile name and profile id. To find your Profile (View) ID login to your Google Analytics account and go to Admin > View Settings. The View ID will be listed first on the page.

3) Find date_ranges on line 165 and enter the date range you’d like for your query. For the first query just enter a single date range. The date range that is included in the script is 2015-09-01 to 2015-09-30.

4) Find the path variable on line 175. This is the file path or the folder location where the CSV data file will be saved.

5) Find the filename variable on line 176. This is the name of the CSV file that the data will be output to. Note that %s is a placeholder variable and the profile name that you specified in step 1 will be written here when the file is created. For example my filename will be google_analytics_data_ryanpraski_1.csv.

***Make sure to change the file name every time you run this Python script. If you do not change the filename, each time you run the Python script the CSV file will be overwritten.***

6) Make sure to save the changes you made to the Python file on your computer.

7) Find the google_analytics_api_v3_10krows_nosampling_multiple_profiles_ryanpraski.py file on your computer and right click to edit with IDLE. Click on Run Module.

8) The sample query I ran had just over 120k total results. You’ll see messages like the screenshot below that shows you the progress of your queries and what remains to still run. You will see the profile name, date range, and approximate total rows of data (rounded to the nearest 10,000) for that date range in the Python Shell. As each query runs you’ll see an update in the Python Shell in real time. When the first query runs, in the Python Shell, you’ll see 10000 of about 100000 rows and when the second query runs you’ll see 20000 of about 100000 rows and so on. If you’ve chosen multiple date ranges you’ll see messages in the Shell for each date range and for each 10,000 row query. When you see the message “All profiles done” printed to the console your queries are complete and you can open your CSV file.

google_analytics_python_10kof120k_rows

For this query there were just over 120000 rows of data in the CSV file. See the screenshow below with 122648 rows.

google_analytics_python_120k_rows_data

9) If you have sampling in your query you will get a message telling you “Error: Query contains sampled data!” and your query will not run.

google_analytics_python_error_sampled_data

10) You will need to go back in and edit the Python file using your text editor. Find the date_ranges variable again on line 165. You now will want to add multiple shorter date ranges. I would recommend first trying to limit the date range to monthly chunks and then go down to daily if you still have sampling.

To add multiple date ranges make sure to remove the comments # and update your date ranges like the example below:

date_ranges = [('2015-08-01',
               '2015-08-31'),
               ('2015-09-01',
               '2015-09-30'),
               ('2015-10-01',
               '2015-10-31'),
               ('2015-11-01',
                '2015-11-30'),
               ('2015-12-01',
               '2015-12-31')]

If your query still has sampling with monthly date ranges you should try breaking the date ranges down to weekly or even daily. Once you’ve chosen a date range granularity with no sampling the query will run. If you have a lot of data it may take some time for all the queries to run.

The screenshot below shows what is printed to the Python Shell for the first two date ranges August and September for a query with multiple date ranges. In this case the query has over 1 million rows in total.

google_analytics_python_api_1_million_rows_multiple_date_ranges

When I open the csv files with the full Google Analytics data set from this query in Excel, Excel 2010 hits the 1,048,576 row limit shown in the screenshot below.

google_analytics_python_api_1_million_rows_multiple_date_ranges_excel

To find out the total number of rows in my data set I can use Python to give me a count of rows using the code below. Make sure to replace my file path with the file path to your csv file.

row_count = sum(1 for row in csv.reader( open('C:\\Users\\ryanpraski\\Documents\\Google Analytics Python Data\\google_analytics_data_ryanpraski_1.csv') ) );print row_count

Run the Same Google Analytics Report Automatically Across Multiple Profiles

This solution also allows you to query multiple Google Analytics profiles in a single run. This can be a massive time saver if you have to run the same report and need the same data across different websites or mobile applications The data set for each profile is stored in a separate csv file. To get data for multiple profiles (views) follow the steps below:

Find profile_ids= {‘ryanpraski’; ‘1234567’} on line 162 of the Python file and comment it out. Just above this, starting on line 154 you can enter multiple profile ids. Just make sure to enter the profile name and the profile id. Recall that he profile name will be used in the csv file name where the data from the query will be stored. See the code sample below for what needs to be updated. Make sure to save and run the updated Python file.

profile_ids = profile_ids = {'My Profile 1':  '1234567',
                             'My Profile 2':  '1234568',
			     'My Profile 3':  '1234569',
                             'My Profile 4':  '1234561'}

# Uncomment this line & replace with 'profile name': 'id' to query a single profile
# Delete or comment out this line to loop over multiple profiles.

## profile_ids = {'ryanpraski':  '1234567'}

If you have any questions please let me know in the comments or on Twitter @ryanpraski.