A binary repository is a hub for development teams across the whole organization, centralizing the management of all the binary artifacts generated and used by the organization. The inevitable resulting diversity of binary artifact types, and their differing positions in the overall workflow, is one major reason to use a dedicated binary repository manager, rather than just a simple file server. But more focused functionality means more decisions.
What a Binary Repository Needs to Store
Binary repositories store (1) files and (2) metadata; plus, for each of these, both (a) releases and (b) nightly builds (based on retain policies). The 'files' set has a complicating subset: third-party artifacts need to be handled differently (for combined legal and technical reasons).
In most cases it's fair to assume that files will not change and deletions will happen only during the nightly builds.
Common Types of Metadata
Selecting appropriate binary metadata can be tricky because many data-points that source files already include (because these both code and metadata are text) need to be specified separately from the binary files themselves. Here are some common metadata types and their uses:
||Upgrading and downgrading automatically
||Other artifacts the current artifact depends on
||Other artifacts that depend on the current artifact
|Build date and time
||Contextual documentation in IDEs; offline availability
||code coverage, rules compliance, test results
||Custom reports and processes
Managing Multiple Repositories
Even smaller organizations will probably need to host several binary repositories, each designated by project, department, and permissions. In this case a full-fledged repository manager, with permissions separate from the file system, will prove especially useful.
Typically each project should include separate snapshot and release repositories, with their desired set of permissions, and repositories for other external projects used.
Source code is inherently searchable; binaries obviously are not, so you'll need to put a little effort into enabling search in your binary repository. Here's what and why:
Metadata search is absolutely essential. Consider the types of metadata (and uses) listed above, and think of how useful a search feature would be – for quickly locating dependencies, isolating problematic artifacts, navigating a tangled web of cross-enterprise projects.
Metadata searches are especially crucial for collaborative development across teams – since one team may very easily be using a binary built by another team, and won't be able to peek through the veil of the other team's code.
Exposed search indexes, for tools like Eclipse plugins, are less essential, but can save hundreds of hours over a period of several months. Most developers want to be able to find an artifact, and immediately add it as a dependency, without leaving the IDE.
Exposing Binaries with APIs
Binary repositories should allow users to upload and download artifacts, typically through HTTP/WebDav.
Users should also be able to query the metadatawith REST APIs. A solid metadata query API will let you do two things: (a) automate configuration, and (b) integrate the repository with other tools, like continuous integration servers. For example, a CI server will need to query the repository metadata to see whether a new version of an artifact is available.
Caching External Binaries
At some point, in some (perhaps most) projects, you'll probably use third party artifacts that are hosted in a repository external to your organization. Network latency and bandwidth will affect development speed directly – especially when your external artifacts are (gigantic) binaries – even if your team is fully on-premise. Now imagine you need to work every day with the latest build of several dependencies and each takes several minutes to download – possibly several times a day. Now consider a long chain of dependencies, and you're immediately (and with no payoff on the development side) in binary download dependency hell.
To skirt this time-sink, cache these files in your repository manager. The cached binaries can be served rapidly to other machines on the same network, after the initial request – either to human coders or directly to CI servers themselves.
Proxying External Binaries
Further, external dependencies introduce an element of unnecessary risk – simply because you can't control access to them. To remove this risk, configure your repo manager to proxy these files. Keep a copy in your private repository; then dependency availability will be up to you. You can also apply your own backup and availability policies, guaranteeing access to the artifacts even if they disappear on the upstream repository.
The external repositories can be proxied (a) on demand or (b) mirrored.
Option 1: on-demand repository
In an on-demand proxy scenario, requests to the remote repositories happen only the first time any developer requests an artifact that is not yet cached in the proxy repository. Any further requests from other developers will use the copy in the proxy repository:
Besides typical HTTP proxy features, a repository manager adds features specific to binary repositories (backed by all the metadata and information stored in the repositories). Consider the advantages of, for example, filtering by group or artifact IDs or expiring unused artifacts to reduce the space needed.
Option 2: mirrored repository
In a mirrored repository scenario, all changes are automatically synchronized to the mirror. So even the first request for an artifact is always resolved from the repository that is closest to you:
The simplest way to implement a mirrored repository is simply to use rsync from the filesystem backing the repository:
More sophisticated mirroring can be accomplished using the repository REST APIs. To
To choose between on-demand and mirror repository proxying, use this table:
|First request for an artifact is resolved from
||Low (only used artifacts are cached)
||High (all artifacts are cached)
|Bandwidth requirements (between proxy and remote)
||Lower in principle (under low demand) but potentially higher, if many developers fetch artifacts at once
||Higher in principle (all artifacts are transferred) but potentially lower in practice, if many developers fetch artifacts at once
Supporting Distributed Teams
When teams that access the repositories are located in different locations or distributed across the globe, it is also important to mirror the internal repositories as well as the third party repos.
To do this, setup repository manager servers hierarchically. Run a server in each location to (a) serve as proxy for the remote server and (b) synchronize the repositories' contents either on demand or (preferably) mirrored.
Use a master server as a write-only instance. Let all the other servers (distributed in different locations) proxy the master for local caching.
When an artifact is pushed to a repository it may not be the final place for it. Imagine a workflow where a release candidate artifact needs to go through integration testing and QA processes. Only artifacts that go through this process should be available for other teams or clients.
A repository manager can enforce this workflow by setting different permissions for each repository while letting only authorized users promote or move the artifacts between repositories.
For instance, when releases are pushed to the release candidate repository, set permissions to allow only the QA team to move artifacts from there to the releases repository when their tests are done (either manually or automatically). The production systems can then be configured to pull only artifacts from the final releases repository, thus enforcing the completion of the QA process.