Home
Spare Cores Crawler#
SC Crawler is a Python package to pull and standardize data on cloud compute resources, with tooling to help organize and update the collected data into databases.
Database schemas#
The database schemas and relationships are visualized and documented at https://dbdocs.io/spare-cores/sc-crawler.
Usage#
The package provides a CLI tool:
Collect data#
Note that you need specific IAM permissions to be able to run sc-crawler
at the below vendors:
Amazon Web Services (AWS)
AWS supports different options for Authentication and access for interacting with their APIs. This is usually an AWS access key stored in ~/.aws/credentials
or in environment variables, or an attached IAM role.
The related user or role requires the below minimum IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCrawler",
"Effect": "Allow",
"Action": [
"pricing:ListPriceLists",
"pricing:GetPriceListFileUrl",
"pricing:GetProducts",
"ec2:DescribeRegions",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInstanceTypes",
"ec2:DescribeSpotPriceHistory"
],
"Resource": "*"
}
]
}
Google Cloud Platform (GCP)
Using the Application Default Credentials for interacting with GCP APIs. This is usually the path to a credential configuration file (created at https://developers.google.com/workspace/guides/create-credentials#service-account) stored in the GOOGLE_APPLICATION_CREDENTIALS
environment variable, but could be an attached service account, Workload Identity Federation etc.
The related user or service account requires the below minimum roles:
- Commerce Price Management Viewer
- Compute Viewer
List of APIs required to be enabled in the project:
Hetzner Cloud
Generate token at your Hetzner Cloud project and store it in the HCLOUD_TOKEN
environment variable.
Microsoft Azure
Authentication is handled via the DefaultAzureCredential
,
so you can use either secrets or certificates.
The following environment variables are required:
AZURE_CLIENT_ID
AZURE_TENANT_ID
To authenticate with secret:
AZURE_CLIENT_SECRET
To authenticate with certificate:
AZURE_CLIENT_CERTIFICATE_PATH
AZURE_CLIENT_CERTIFICATE_PASSWORD
(optional)
For further options, consult the EnvironmentCredential
docs.
Optionally, you can also specify the Subscription (otherwise the first one found in the account will be used):
AZURE_SUBSCRIPTION_ID
The related Service Principal requires either the global "Reader" role, or if the following list of (more restrictive) permissions:
Microsoft.Resources/subscriptions/locations/read
To create the Service Principal, go to App registrations, and then assign the role at the Subscription's Access control page.
Fetch and standardize datacenter, zone, servers, traffic, storage etc data from AWS into a single SQLite file:
Such an up-to-date SQLite database is managed by the Spare Cores team in the SC Data repository, or you can also find it at https://sc-data-public-40e9d310.s3.amazonaws.com/sc-data-all.db.bz2.
Example run:
Hash data#
Database content can be hashed via the sc-crawler hash
command. It will provide
a single SHA1 hash value based on all records of all SC Crawler tables, which is
useful to track if database content has changed.
$ sc-crawler hash --connection-string sqlite:///sc-data-all.db
b13b9b06cfb917b591851d18c824037914564418
For advanced usage, check sc_crawler.utils.hash_database to hash tables or rows.
Copy and sync data#
To copy data from a database to another one or sync data between two databases, you can use the copy
and sync
subcommands, which also support feeding SCD tables.
Database migrations#
To generate CREATE TABLE
statements using the current version of the Crawler schemas,
e.g. for a MySQL database:
See sc-crawler schemas create --help
for all supported database engines
(mainly thanks to SQLAlchemy), and other options.
sc-crawler schemas
also supports many other subcommands based on Alembic,
e.g. upgrade
or downgrade
schemas in a database (either just printing
the related SQL commands via the --sql
flag), printing the current version,
setting a database version to a specific revision, or auto-generating migration
scripts (for SC Crawler developers).
ORM#
SC Crawler is using SQLModel / SQLAlchemy as the ORM to interact with the underlying database, and you can also use the defined schemas and models to actually read/filter a previously pulled DB. Quick examples:
from sc_crawler.tables import Server
from sqlmodel import create_engine, Session, select
engine = create_engine("sqlite:///sc-data-all.db") # (1)!
session = Session(engine) # (2)!
server = session.exec(select(Server).where(Server.server_id == 'trn1.32xlarge')).one() # (3)!
from rich import print as pp # (4)!
pp(server)
pp(server.vendor) # (5)!
- Creating a connection (pool) to the SQLite database.
- Define an in-memory representation of the database for the ORM objects.
- Query the database for the Server with the
trn1.32xlarge
id. - Use
rich
to pretty-print the objects. - The
vendor
is a Vendor relationship of the Server, in this case being aws.