This explains indexes, indexing, index controls, and index requirements in SPK.
SPK supports generating a packages index for a repository to speed
up solves. An index must have been generated separately for that
repository before it can be used.
The index is designed to help the solvers with solves (package
look-ups, deprecation checks, pre-loading non-package variable
references, getting install requirements, etc.). It does not contain
the full package data. So it does not have the information needed to
help other SPK operations, e.g. building or testing a package.
If indexing is enabled, the index must be generated before trying to use it in a solve. They are not generated on the fly (outside of tiny repositories for automated tests).
If index use is enabled, but no index has been generated for a
repository, SPK will fallback to using the underlying repository’s
packages directly. This typically results in slower solves,
particularly with larger repositories.
If a solve uses multiple repositories and indexes exist for some or
all of them, SPK will use the indexes that exist.
SPK index use can be enabled or disabled in the SPK config file.
It is enabled by default but require index generation and index
updating to be set up, on a per repository basis, before it will have
an impact. Setting this up is recommended for repositories with a
large number of packages.
Most SPK commands can use the --index-use <value> option to override
repository index use on a per command basis. The spk repo index ...
command always has index use disabled because it operates on the
indexes themselves. spk info ... also disables indexes because
displaying complete package information requires reading the package
from the repository directly, not just the subset of data in an index.
If index use is enabled in the config file, it can be disabled with
the --index-use disabled command line option.
To generate an index for a repository (e.g. origin), you can run:
spk repo index -r origin
This will generate a flatbuffer schema based index file. The index
file is stored in the underlying spfs repo (e.g. origin repo) in a
index/spk/ sub-directory.
Index generation only works on one repository at a time. If you have
multiple repositories to index you have to run spk repo index ...
once for each repository.
See spk repo index -h for more details.
You can also set up a SPK Indexer, if you also have a SPK message channel configured. See the ‘Automatic index updates …’ sections below.
You do not have to generate a full index every time a package changes, you can update a package in an existing index, see the next section.
When index use is enabled, it is important to update the indexes when changes happen to the repository and its packages. Otherwise, the index and real package data will get out of sync. An index that is old will not have the same data as the repositories it is based on. This can lead to discrepancies in solves.
Updating a package in an existing index, such as after a new build is published or a package is deprecated, is not automatic.
A site using SPK has to set up a system to trigger index updates
when they want them to happen. The recommendation is after a new
package is published, deprecated, undeprecated, or deleted. But
regular periodic complete index generation may also work for a site,
depending on the frequency of package changes and the periodic updates.
To update an existing index, e.g. the origin repo’s index after a
new python package was published or deprecated, run:
spk repo index -r origin --update python
The --update option can be given multiple times to update several
packages at a once, e.g.:
spk repo index -r origin --update python --update zlib
The --update option can also take a package/version. This lets the
update be restricted to a specific version of a package. This can make
for shorter update times for packages with large numbers of versions,
or builds per version, e.g.:
spk repo index -r origin --update python/3.10.8 --update zlib/1.2.12
These commands will read in the existing index for the repository, and update the versions and builds of the named package in the index. This is faster than generating an index from scratch. But it has to be run once per repository to update the give packages in that repository’s index.
See spk repo index -h for more details.
You can also set up a SPK Indexer to update a repositories index when a package changes, if you also have a SPK message channel configured. See the ‘Automatic index updates …’ sections below.
SPK has support for automatically updating a repository’s index when a package is published, or modified (e.g. deprecated), if there is an external messaging channel configured for SPK to talk to.
SPK current only supports kafka as a messaging channel.
With a messaging channel configured, SPK will be able to send package update messages (to a topic/queue) when any of these spk command are used:
spk publishspk deprecatespk undeprecatespk removeThis allows an external monitoring system to see package updates and act on them, e.g. by triggering an index update.
SPK’s spk repo indexer --name ... command can be used to run such a
monitoring system that works with SPK’s configured messaging channels
(e.g. spk repo indexer --name my_indexer_config_name, see spk repo indexer -h for more options).
An SPK Indexer will run forever and listen to the package update messages for a single repository. It wil accumulate the package updates, and when there is a pause in messages, it will perform an index update on that repository’s index, and then wait for more package update messages.
While it is running, the SPK indexer will send heartbeat messages (to a topic/queue) to indicate it is alive. While it is updating an index, the SPK indexer will send index status update messages to indicate when an index update has started, is in-progress, and has completed.
The SPK commands that send package update messages will also wait and listen for index status update messages. They do this after they have finished publishing or updating the packages they were acting on, and only if a message channel is configured for their use.
Those SPK commands use the index status update messages to workout when that repository’s index update is complete, and then they return control to the user. This allows a site to ensure package updates are reflected in an index before a user’s next spk command is run. The indexer’s heartbeat and index update messages are used to detect if the indexer isn’t running so the commands do not wait forever.
See the SPK messaging docs and the SPK config file reference docs for more details.
The SPK Indexer is meant to be run continuously as a service to monitor one repository and update that repository’s index. If you have multiple repositories, and you want to automatically update all their indexes, you will need to config and run multiple SPK Indexers - each with their own config name and section.
It is up to each site to deploy the SPK Indexer(s) as a service in the most appropriate way for that site.
The SPK Indexer will make an index from scratch if one does not exist for the repository it is monitoring. This allows an Indexer to be used to bootstrap an index for a repository, and recover an index if it is removed or another problem occurs.
The special admin “generate a full index” package update message can be sent to make a SPK Indexer regenerate a repository’s index from scratch, even if one exists already.
See the SPK messaging docs and the SPK config file reference docs for more details.
These SPK commands send package update messages (to a topic/queue) when they run:
spk publishspk deprecatespk undeprecatespk removeThey also listen to index status messages (from a topic/queue), and wait for their repository’s index to update before the commands finish. Messages about other repositories’ indexes are ignored.
However, because these commands are typically run by users, the cannot wait forever. The first thing they do is check the latest message (in the topic/queue) and to see if it is recent enough, e.g. within the last minute or so. A recent enough index status message means that a SPK Indexer is active and sending messages. In this case the commands will continue to listen and wait for index updates. If that message is too old, the command assumes the indexer is down, that the index is not going to be updated soon, and it exits immediately rather than wait.
Provided they get recent enough messages, the commands will check the
index_start_time field of the messages. They are looking for times
that are after the command finished its updates (e.g. spk publish
has published all its packages). Those index updates will include the
package changes made by the command. When the command gets an ‘Index
completed’ message with a late enough index start time, it knows the
index contains its changes and it will stop listening and can exit
successfully.
SPK supports file-based flatbuffer formatted indexes.
An index is stored in a file inside the repository it indexes. Only
SPK filesystem repositories (based on spfs fs repositories) support
indexes. The file will be kept in the index/spk/ sub-directory of an
spfs fs repository.
An index is implemented by a set of types in the spk-storage and spk-schema crates that work together to wrap a repository with its index and produce indexed package objects that act as packages for the solver.
The hierarchy of index related types is:
in the 'spk-storage' crate:
RepositoryHandle enum
Indexed struct (implements Repository and Storage traits)
RepoIndex enum (implements RepoIndex and RepoIndexMut traits)
FlatBufferRepoIndex struct (implements RepoIndex and RepoIndexMut traits)
flatbuffers schema
in the 'spk-schema' crate:
IndexedPackage (implements Package traits, produced by a FlatBufferRepoIndex)
flatbuffers schema
SPK uses flatbuffers for the index data format on disk. This is fast
to read and use, but slow to generate. Updates to an existing index
require an index to be read in completely and written out again with
the updated data. It does not support updating complex structures in
an index in-place
The spk-proto crate contains SPK’s flatbuffers schema for an
index. It requires flatc to be installed to generate the rust code
for the index schema.
The schema is versioned internally, and the index file name contains the index schema version number as well. This is checked on load against the index version SPK is compatible with to schema and SPK version mismatches.
The index structure does not match the rust object structures one-to-one. This is partially due to only keeping things needed for solves and solver operations, and partially due to what flatbuffers require, support, and recommend for speed (lists not sets or mappings, removing duplication and intermediate structures).
The broad top-level schema structure is (capital letters indicate a schema table name):
RepositoryIndex
schema version number
list of PackageIndex
name
list of VersionIndex
version number
list of BuildIndex
build digest/id
... other fields needed for solving
list of GlobalVar (used to prime the 'resolvo' solver to avoid restarts)
name
list of valid values
The structure is quite deep to cover the data for options, requirements, compat rules, version numbers, build ids, and components. See the schema file itself for more detailed information, including which fields of rust objects are being ignored, omitted from an index.
The flatbuffers data format supports adding new fields and structures without breaking existing code, provided fields are not deleted. New fields should be added only to the ends of tables.
The index schema’s version number lets SPK check that the index data
is compatible with the current SPK being run. That is, that the code
understands this version of the schema. There will be situations where
multiple versions of SPK are being used in a site and they may not all
be able to use index data generated by other versions of SPK.
SPK includes the schema version number in the index file name to allow
multiple index schemas to co-exist during a transition between index
versions.
When a new field needs to be added to the index schema, the developer should:
SPK’s compatible schema version numberDuring the transition the older version of SPK will use the older index format(s) and the newer SPK will use the newer one. Both indexes will have to be generated and kept up to date during the transition. Once the older SPK version is fully retired, the older index data can be retired too.
When an older field does not need to be read anymore, the developer should:
There is no need to change the index schema version for to stop reading a field, and index generation remains unchanged (even though it includes data that is not read by SPK anymore).
If an older field no longer needs to be populated, the developer should:
Technically, the field will still be present in the index schema, and be set in index generation because it is needed in the index’s object constructors. But the None values will reduce the space the field takes up. And while this kind of change in backwards compatible, the empty field may be treated a default value by older versions of SPK when read from the index - thus the need to version up the index schema to ensure this does not cause unexpected behaviour when multiple versions of SPK are in use.
Removing an old field will break backwards compatibility of the flatbuffers format. It is not something that should be done lightly.
When a new field needs to be removed from the index schema, the developer should:
During the transition the older version of SPK will use the older index format(s) and the newer SPK will use the newer one. Both indexes will have to be generated and kept up to date during the transition. Once the older SPK version is fully retired, the older index data and its generation can be retired too.