Home

Spare Cores Crawler#

SC Crawler is a Python package to pull and standardize data on cloud compute resources, with tooling to help organize and update the collected data into databases.

Database schemas#

The database schemas and relationships are visualized and documented at https://dbdocs.io/spare-cores/sc-crawler.

Usage#

The package provides a CLI tool:

sc-crawler --help

Collect data#

Note that you need specific IAM permissions to be able to run sc-crawler at the below vendors:

Amazon Web Services (AWS)

AWS supports different options for Authentication and access for interacting with their APIs. This is usually an AWS access key stored in ~/.aws/credentials or in environment variables, or an attached IAM role.

The related user or role requires the below minimum IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowCrawler",
            "Effect": "Allow",
            "Action": [
                "pricing:ListPriceLists",
                "pricing:GetPriceListFileUrl",
                "pricing:GetProducts",
                "ec2:DescribeRegions",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeSpotPriceHistory",
                "ec2:DescribeInstanceTypeOfferings"
            ],
            "Resource": "*"
        }
    ]
}

Google Cloud Platform (GCP)

Using the Application Default Credentials for interacting with GCP APIs. This is usually the path to a credential configuration file (created at https://developers.google.com/workspace/guides/create-credentials#service-account) stored in the GOOGLE_APPLICATION_CREDENTIALS environment variable, but could be an attached service account, Workload Identity Federation etc.

The related user or service account requires the below minimum roles:

Commerce Price Management Viewer
Compute Viewer

List of APIs required to be enabled in the project:

Hetzner Cloud

Generate token at your Hetzner Cloud project and store it in the HCLOUD_TOKEN environment variable.

Microsoft Azure

Authentication is handled via the DefaultAzureCredential, so you can use either secrets or certificates.

The following environment variables are required:

AZURE_CLIENT_ID
AZURE_TENANT_ID

To authenticate with secret:

AZURE_CLIENT_SECRET

To authenticate with certificate:

AZURE_CLIENT_CERTIFICATE_PATH
AZURE_CLIENT_CERTIFICATE_PASSWORD (optional)

For further options, consult the EnvironmentCredential docs.

Optionally, you can also specify the Subscription (otherwise the first one found in the account will be used):

AZURE_SUBSCRIPTION_ID

The related Service Principal requires either the global "Reader" role, or if the following list of (more restrictive) permissions:

Microsoft.Resources/subscriptions/locations/read

To create the Service Principal, go to App registrations, and then assign the role at the Subscription's Access control page.

Fetch and standardize datacenter, zone, servers, traffic, storage etc data from AWS into a single SQLite file:

sc-crawler pull --connection-string sqlite:///sc-data-all.db --include-vendor aws

Such an up-to-date SQLite database is managed by the Spare Cores team in the SC Data repository, or you can also find it at https://sc-data-public-40e9d310.s3.amazonaws.com/sc-data-all.db.bz2.

Example run:

Hash data#

Database content can be hashed via the sc-crawler hash command. It will provide a single SHA1 hash value based on all records of all SC Crawler tables, which is useful to track if database content has changed.

$ sc-crawler hash --connection-string sqlite:///sc-data-all.db
b13b9b06cfb917b591851d18c824037914564418

For advanced usage, check sc_crawler.utils.hash_database to hash tables or rows.

Copy and sync data#

To copy data from a database to another one or sync data between two databases, you can use the copy and sync subcommands, which also support feeding SCD tables.

Database migrations#

To generate CREATE TABLE statements using the current version of the Crawler schemas, e.g. for a MySQL database:

sc-crawler schemas create --dialect mysql

See sc-crawler schemas create --help for all supported database engines (mainly thanks to SQLAlchemy), and other options.

sc-crawler schemas also supports many other subcommands based on Alembic, e.g. upgrade or downgrade schemas in a database (either just printing the related SQL commands via the --sql flag), printing the current version, setting a database version to a specific revision, or auto-generating migration scripts (for SC Crawler developers).

ORM#

SC Crawler is using SQLModel / SQLAlchemy as the ORM to interact with the underlying database, and you can also use the defined schemas and models to actually read/filter a previously pulled DB. Quick examples:

from sc_crawler.tables import Server
from sqlmodel import create_engine, Session, select

engine = create_engine("sqlite:///sc-data-all.db") # (1)!
session = Session(engine) # (2)!
server = session.exec(select(Server).where(Server.server_id == 'trn1.32xlarge')).one() # (3)!

from rich import print as pp # (4)!
pp(server)
pp(server.vendor) # (5)!

Creating a connection (pool) to the SQLite database.
Define an in-memory representation of the database for the ORM objects.
Query the database for the Server with the trn1.32xlarge id.
Use rich to pretty-print the objects.
The vendor is a Vendor relationship of the Server, in this case being aws.