Python pandas.DataFrame.to_gbq函数方法的使用

Pandas是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。Pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。本文主要介绍一下Pandas中pandas.DataFrame.to_gbq方法的使用。

DataFrame.to_gbq(destination_table, project_id=None, chunksize=None, reauth=False, if_exists='fail', auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, credentials=None) [source]

将DataFrame写入Google BigQuery表。

此功能需要pandas-gbq软件包。

有关身份验证的说明，请参见如何使用Google BigQuery进行身份验证指南。

参数：

destination_table ：str

要写入的表格名称，格式为dataset.tablename。

project_id ：str，可选

Google BigQuery帐户项目编号。

在环境中可用时为可选。

chunksize ：int，可选

要从dataframe插入每个块的行数。

设置为None一次加载整个dataframe。

reauth ：bool, 默认为 False

强制Google BigQuery重新验证用户身份。

如果使用多个帐户，这将很有用。

if_exists ：str，默认为’fail‘

目标表存在时的行为。值可以是以下之一：

'fail'：

如果存在表，

则引发pandas_gbq.gbq.TableCreationError。

'replace'：

如果存在表，则将其删除，重新创建并插入数据。

'append'：

如果存在表，则插入数据。如果不存在则创建。

auth_local_webserver ：bool，默认为False

获取用户凭据时，

请使用本地Web服务器流而不是控制台流。

pandas-gbq的0.2.0版本中的新功能。

table_schema ：字典类型的list,可选

与DataFrame列对应的BigQuery表字段的列表，

例如。如果未提供schema，

它将根据DataFrame列的dtypes生成。

有关字段的可用名称，请参阅BigQuery API文档。

[{'name': 'col1', 'type': 'STRING'},...]

pandas-gbq的0.3.1版本中的新功能。

location ：str,可选的

加载作业应运行的位置。

有关可用位置的列表，

请参见BigQuery位置文档。

该位置必须与目标数据集的位置匹配。

pandas-gbq的0.5.0版本中的新功能。。

progress_bar ：bool，默认为True

使用库tqdm逐块显示上传的进度条。

pandas-gbq的0.5.0版本中的新功能。

credentials ：google.auth.credentials.Credentials, 可选

用于访问Google API的凭据。使用此参数可以覆盖默认凭据，

例如直接使用Compute Engine google.auth.compute_engine.Credentials

或服务帐户google.oauth2.service_account.Credentials。

pandas-gbq的0.8.0版本中的新功能。

0.24.0版中的新功能。

例子，

from datalab.context import Contextimport datalab.storage as storageimport datalab.bigquery as bqimport pandas as pdfrom pandas import DataFrameimport time# Dataframe to writemy_data = [{1,2,3}]for i in range(0,100000):    my_data.append({1,2,3})not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])#Alternative 1start = time.time()not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',                  Context.default().project_id,                 chunksize=10000,                  if_exists='append',                 verbose=False                 )end = time.time()print("time alternative 1 " + str(end - start))#Alternative 3start = time.time()sample_bucket_name = Context.default().project_id + '-datalab-example'sample_bucket_path = 'gs://' + sample_bucket_namesample_bucket_object = sample_bucket_path + '/Hello.txt'bigquery_dataset_name = 'TestDataSet'bigquery_table_name = 'TestTable'# Define storage bucketsample_bucket = storage.Bucket(sample_bucket_name)# Create or overwrite the existing table if it existstable_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)# Write the DataFrame to GCS (Google Cloud Storage)%storage write --variable not_so_simple_dataframe --object $sample_bucket_object# Write the DataFrame to a BigQuery tabletable.insert_data(not_so_simple_dataframe)end = time.time()print("time alternative 3 " + str(end - start))