Home
Spare Cores Crawler#
SC Crawler is a Python package to pull and standardize data on cloud compute resources, with tooling to help organize and update the collected data into databases.
Database schemas#
The database schemas and relationships are visualized and documented at https://dbdocs.io/spare-cores/sc-crawler.
Usage#
The package provides a CLI tool:
Collect data#
Note that you need specific IAM permissions to be able to run sc-crawler at the below vendors:
Amazon Web Services (AWS)
AWS supports different options for Authentication and access for interacting with their APIs. This is usually an AWS access key stored in ~/.aws/credentials or in environment variables, or an attached IAM role.
The related user or role requires the below minimum IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCrawler",
"Effect": "Allow",
"Action": [
"pricing:ListPriceLists",
"pricing:GetPriceListFileUrl",
"pricing:GetProducts",
"ec2:DescribeRegions",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInstanceTypes",
"ec2:DescribeSpotPriceHistory",
"ec2:DescribeInstanceTypeOfferings"
],
"Resource": "*"
}
]
}
Google Cloud Platform (GCP)
Using the Application Default Credentials for interacting with GCP APIs. This is usually the path to a credential configuration file (created at https://developers.google.com/workspace/guides/create-credentials#service-account) stored in the GOOGLE_APPLICATION_CREDENTIALS environment variable, but could be an attached service account, Workload Identity Federation etc.
The related user or service account requires the below minimum roles:
- Commerce Price Management Viewer
- Compute Viewer
List of APIs required to be enabled in the project:
Hetzner Cloud
Generate token at your Hetzner Cloud project and store it in the HCLOUD_TOKEN environment variable.
Microsoft Azure
Authentication is handled via the DefaultAzureCredential,
so you can use either secrets or certificates.
The following environment variables are required:
AZURE_CLIENT_ID(application client ID)AZURE_TENANT_ID
To authenticate with secret:
AZURE_CLIENT_SECRET(secret value)
To authenticate with certificate:
AZURE_CLIENT_CERTIFICATE_PATHAZURE_CLIENT_CERTIFICATE_PASSWORD(optional)
For further options, consult the EnvironmentCredential docs.
Optionally, you can also specify the Subscription (otherwise the first one found in the account will be used):
AZURE_SUBSCRIPTION_ID
The related Service Principal requires either the global "Reader" role, or the following list of (more restrictive) permissions:
Microsoft.Resources/subscriptions/locations/read
To create the Service Principal, go to App registrations, and then assign the role at the Subscription's Access control page.
UpCloud
Create a user (subaccount) in the UpCloud control panel with "API connections" permission enabled. Then configure the following environment variables:
UPCLOUD_USERNAMEUPCLOUD_PASSWORD
OVHcloud
You need to create a cloud project, optionally enable all regions, then create and configure a service account as described in the Managing OVHcloud service accounts via the API document. In short:
- Create a service account via a
POSTAPI query to/me/api/oauth2/client. - Get its
arnvia aGETto/me/api/oauth2/client/{client_id}(using theclient_idfrom above). - Set minimum permissions for the
arnvia aPOSTto/iam/policy, e.g.:
{
"description": "Minimum permissions for sc-data",
"identities": ["{your_urn}"],
"name": "sc-data",
"permissions": {
"allow": [
{
"action": "account:apiovh:me/get"
},
{
"action": "publicCloudProject:apiovh:get"
},
{
"action": "publicCloudProject:apiovh:region/get"
},
{
"action": "publicCloudProject:apiovh:flavor/get"
}
]
},
"resources": [
{
"urn": "urn:v1:eu:resource:account:{your_account_id}-ovh"
}
]
}
Then configure the following environment variables:
OVH_ENDPOINT(e.g. "ovh-eu")OVH_CLIENT_IDOVH_CLIENT_SECRET
By default, the first project found in the account will be used.
Optionally, you can also specify the project ID to override that behavior
via the OVH_PROJECT_ID environment variable.
For the price catalog, we default to using the IE (Ireland) OVH subsidiary,
which can be overridden via the OVH_SUBSIDIARY environment variable.
This choice affects the currency used for prices.
Fetch and standardize datacenter, zone, servers, traffic, storage etc data from AWS into a single SQLite file:
If you need to run this many times, set the SC_CRAWLER_INSPECTOR_DATA_PATH
environment variable to a directory for caching the inspector data, so that it
won't be downloaded multiple times.
Such an up-to-date SQLite database is managed by the Spare Cores team in the SC Data repository, or you can also find it at https://sc-data-public-40e9d310.s3.amazonaws.com/sc-data-all.db.bz2.
Example run:
Hash data#
Database content can be hashed via the sc-crawler hash command. It will provide
a single SHA1 hash value based on all records of all SC Crawler tables, which is
useful to track if database content has changed.
$ sc-crawler hash --connection-string sqlite:///sc-data-all.db
b13b9b06cfb917b591851d18c824037914564418
For advanced usage, check sc_crawler.utils.hash_database to hash tables or rows.
Copy and sync data#
To copy data from a database to another one or sync data between two databases, you can use the copy and sync subcommands, which also support feeding SCD tables.
Database migrations#
To generate CREATE TABLE statements using the current version of the Crawler schemas,
e.g. for a MySQL database:
See sc-crawler schemas create --help for all supported database engines
(mainly thanks to SQLAlchemy), and other options.
sc-crawler schemas also supports many other subcommands based on Alembic,
e.g. upgrade or downgrade schemas in a database (either just printing
the related SQL commands via the --sql flag), printing the current version,
setting a database version to a specific revision, or auto-generating migration
scripts (for SC Crawler developers).
ORM#
SC Crawler is using SQLModel / SQLAlchemy as the ORM to interact with the underlying database, and you can also use the defined schemas and models to actually read/filter a previously pulled DB. Quick examples:
from sc_crawler.tables import Server
from sqlmodel import create_engine, Session, select
engine = create_engine("sqlite:///sc-data-all.db") # (1)!
session = Session(engine) # (2)!
server = session.exec(select(Server).where(Server.server_id == 'trn1.32xlarge')).one() # (3)!
from rich import print as pp # (4)!
pp(server)
pp(server.vendor) # (5)!
- Creating a connection (pool) to the SQLite database.
- Define an in-memory representation of the database for the ORM objects.
- Query the database for the Server with the
trn1.32xlargeid. - Use
richto pretty-print the objects. - The
vendoris a Vendor relationship of the Server, in this case being aws.