At its Cloud Data Summit, Google today announced the preview launch of BigLake, a new data lake storage engine that makes it easier for enterprises to analyze data in their data warehouses and lakes. of data.
The idea here, basically, is to take Google’s experience with running and managing its BigQuery data warehouse and extend it to data lakes on Google Cloud Storage, combining the best of lakes of data and warehouses into a single service that abstracts the underlying storage. formats and systems.
This data, it should be noted, could reside in BigQuery or also live on AWS S3 and Azure Data Lake Storage Gen2. With BigLake, developers will have access to a uniform storage engine and the ability to query underlying data stores through a single system without the need to move or duplicate data.
“Managing data across disparate lakes and warehouses creates silos and increases risk and cost, especially when data needs to be moved,” says Gerrit Kazmaier, vice president and general manager of databases, data analysis and business intelligence at Google Cloud., note in today’s announcement. “BigLake enables companies to unify their data warehouses and lakes to analyze data without worrying about the underlying format or storage system, eliminating the need to duplicate or move data from one source and reduces costs and inefficiencies.
Using policy tags, BigLake allows administrators to configure their security policies at the table, row, and column level. This includes data stored in Google Cloud Storage, as well as the two supported third-party systems, where BigQuery Omni, Google’s multi-cloud analytics service, enables these security controls. These security checks also ensure that only the right data flows through tools like Spark, Presto, Trino, and TensorFlow. The service also integrates with Google’s Dataplex tool to provide additional data management features.
Google notes that BigLake will provide granular access controls and that its API will cover Google Cloud, as well as file formats such as open column-oriented Apache Parquet and open-source processing engines like Apache Spark.
“The volume of valuable data that organizations must manage and analyze is growing at an incredible rate,” Google Cloud software engineer Justin Levandoski and product manager Gaurav Saxena explain in today’s announcement. “That data is increasingly distributed across many sites, including data warehouses, data lakes, and NoSQL stores. As an organization’s data becomes more complex and proliferates across disparate data environments, silos appear, creating increased risks and costs, especially when that data needs to be moved. Our customers have made this clear; they need help.”
In addition to BigLake, Google also announced today that Spanner, its globally distributed SQL database, will soon get a new feature called “edit streams.” With these, users can easily follow in real time all the changes made to a database, whether they are inserts, updates or deletions. “This ensures that customers always have access to the latest data, as they can easily replicate changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior using Pub/Sub, or store changes in Google Cloud Storage (GCS) for compliance,” says Kazmaier.
Google Cloud also today brought Vertex AI Workbench, a tool for managing the entire lifecycle of a data science project, out of beta and general availability, and launched Connected Sheets for Looker, as well as as the ability to access Looker data models in its Data Studio BI tool.