What makes Google Cloud stand out is the software Google has built to manage the massive global pool of machines. Although the hardware Google uses is custom-designed, it is nothing special—simply hardware constructed to minimize the total cost of ownership. Given the sheer number of components, failures of machines and disks are frequent. The software introduced next manages hardware failures so that any problems are abstracted away.
Borg
Whether you need to run serverless code for 100ms, need a virtual machine that runs indefinitely, or consume compute through a Google Cloud managed service, Borg will manage the compute. When you make a request, a job is sent to Borg that finds suitable machines to run tasks on. Borg will monitor each task, and if it malfunctions, it will be restarted or moved to a different machine. Borg inspired Kubernetes, so the concepts may be familiar.
However, as a user of Google Cloud, you will know nothing about Borg but will benefit from its abstraction. You request, directly or indirectly, the resources you require in terms of CPU cores and RAM. Borg fulfills everyone’s request behind the scenes, making the best utilization of the machines available and seamlessly working around any failures.
Colossus
While machines have local disks, these are only used for temporary storage. For managing permanent storage, Google uses a system named Colossus.
Storage pools consist of spinning discs and flash disks of different capacities. Again, these are selected to minimize their total cost of ownership, so they can and will fail. Colossus sets out to work around any failures and fills the disks as optimally as possible, as any empty space is wasted capacity.
Colossus constantly rebalances where data is stored. Frequently accessed (hot) data is stored on the more expensive fast disks. Less frequently accessed (cold) data is on slower, cheaper disks. As with compute, this means the details of storage are abstracted away. As a user, or more accurately, a service requesting on your behalf, you request the bytes and performance characteristics (or input/output operations per second [IOPS]) required, and Colossus takes care of it.
Spanner
Colossus also forms the foundation of two core database abstractions that support petabytes of data. Bigtable is a petabyte-scale NoSQL database that is eventually consistent across regions. Spanner is a SQL-compatible database that offers strong consistency across regions. Google uses Spanner internally for many systems. For example, Google Photos stores metadata for 4 trillion photos and videos from a billion users in Spanner. Some public references are available in a Spanner blog. These abstractions allow the decisions to be about the type of consistency and level of availability needed rather than the details of how it is achieved.