SOS API Reliability: How Cobalt Handles State Portal Outages

Executive Summary: SOS API reliability state portal outage planning starts with one uncomfortable fact: there is no single Secretary of State system, there are more than 50 of them, each built, hosted, and maintained on its own schedule. New York, California, and Delaware run separate portals with separate search interfaces, separate uptime, and separate maintenance windows.^[1]^[2]^[3] An API layer that sits in front of those portals cannot be more reliable than the sources on a given day, but it can absorb individual portal failures with caching, retries, async delivery, and graceful degradation so that one slow state does not stall an underwriting queue.^[4] This guide is written for engineering and risk leaders running a build-versus-buy evaluation. It explains why state portals fail individually and how a lender should design integration logic for partial coverage. Cobalt is treated honestly throughout: it is a data source, not a decisioning engine, it does not guarantee any state's uptime, and cached data can be stale.

Why is SOS API reliability state portal outage planning a heterogeneity problem?

What makes 50 portals different from one API?

A Secretary of State portal is a government web application, not a commercial data feed. Each state owns its own technology stack, refresh cadence, and downtime calendar. New York routes corporation and UCC lookups through its own division systems.^[1] California exposes business entity records through BizFile Online and lists processing dates that shift over time.^[2] Delaware runs a separate name-search interface and charges a state fee for certain status checks.^[3] None of them coordinate maintenance with each other.

The failure domain is therefore per-state, not global. An outage in Oregon has no bearing on Texas. A reliability strategy that treats the data source as one system will be wrong, because the real system is dozens of independent systems with independent failure modes.

Why do individual portals go down?

Portals fail for ordinary operational reasons, not exotic ones. The common causes are predictable enough to design around.

• Scheduled maintenance. State systems take planned downtime, often overnight or on weekends, with little public notice.

• Rate limits and blocking. Portals built for occasional human use push back on automated request volume.

• Backend latency. Older state infrastructure can take far longer to return a record, with wide variance between states.

• Markup and form changes. A redesigned search page can break field parsing until the integration is updated.

• Partial data availability. Officer or ownership fields are present in some states and absent in others, so a successful call can still return less than a complete record.

The practical takeaway is that "the API is down" is rarely the right diagnosis. More often, one upstream state is unavailable while the rest are fine.

How should an API layer handle a single state being unavailable?

What does live versus cached actually trade off?

Cobalt exposes a `liveData` parameter on the Secretary of State search. With `liveData=true`, the request pulls in real time directly from the state source and can take 10 to 180 seconds depending on the state. With `liveData=false`, the request returns from Cobalt's monthly-refreshed cache in under a second. That is a direct reliability lever: cached reads do not depend on the state portal being up at request time, but they can be stale.

Mode	Speed	Freshness	Dependency on live portal
`liveData=true`	10 to 180 seconds	Real time	Yes, fails if portal is down
`liveData=false`	Under 1 second	Monthly refresh	No, served from cache

Neither mode is universally correct. Final verification before funding wants `liveData=true`; high-volume pre-screening can tolerate `liveData=false`. When a state portal is having a bad day, cached reads keep a pipeline moving, as long as the integration records that the data was cached and how old it is.

How does a waterfall pattern reduce outage exposure?

The documented pattern is a waterfall: check the cache first with `liveData=false`, and escalate to a live lookup only when there is no match or the cached record looks stale. This keeps most requests fast and off the live portal, which lowers cost and shrinks the window where a portal outage can hurt you.

curl --location 'https://apigateway.cobaltintelligence.com/v1/search?searchQuery=Acme%20Corp&state=delaware&liveData=true&screenshot=true' \
--header 'x-api-key: YOUR_API_KEY' \
--header 'Accept: application/json'

Flip `liveData` to `false` for the cache-first leg of the waterfall, then retry with `liveData=true` only when the cached answer is missing or older than your policy allows.

What retry and backoff behavior belongs in the integration?

Why are SOS reads safe to retry?

The Secretary of State search is an HTTP GET, which is idempotent: making the same request several times has the same effect on the server as making it once, so a client can safely retry when there is doubt about whether a request completed.^[7] That property makes automated retry on a transient portal error sound rather than reckless, because a retried read does not create duplicate state.

Cobalt returns standard status codes that map cleanly to retry decisions. A 429 means rate limited and the caller should slow down. A 500 means server error and the caller should retry with exponential backoff. A 404 may simply mean the business was not found in that state, a data outcome rather than a failure to retry blindly.

How do you retry without amplifying an outage?

Retries are useful and also dangerous. Naive retry loops turn a struggling backend into a worse one by piling on load, the textbook cause of cascading failure.^[5] The fix is bounded, jittered retries plus a retry budget so the client never spends more than a small fraction of its traffic on retries.^[4]

• Exponential backoff with jitter. Space retries out and randomize the delay so callers do not synchronize into a thundering herd.

• Retry budget. Cap retries as a percentage of total requests so a wide outage cannot multiply your own load.

• Honor Retry-After. When a 429 or 503 includes a Retry-After header, wait that long instead of guessing.

• Cap attempts. A small fixed ceiling, such as three attempts, prevents an infinite loop on a hard-down state.

• Fail to cache. When live retries are exhausted, fall back to `liveData=false` and label the result as cached.

The goal of retry logic is not to win every request. It is to ride out a transient portal failure without becoming the second cause of the outage.

How do async callbacks keep slow states from blocking a queue?

When should a request be asynchronous instead of synchronous?

Some live lookups exceed a normal request timeout. Oregon, for example, can take several minutes for a live result, so a synchronous call that blocks a worker is the wrong shape. Cobalt supports two async patterns: a `retryId` that the client polls for completion, or a `callbackUrl` that receives the finished result by POST when it is ready.

This matters for queue health. If a single slow state holds a synchronous worker for three minutes, throughput collapses under load. Async delivery lets the integration acknowledge the request immediately and process the result when it lands, so one slow portal does not back up the whole queue.

What does the polling versus callback choice look like?

Polling with a `retryId` suits batch jobs and environments without a public endpoint: the client submits the search, receives a `retryId`, and checks back until the status is complete. Callbacks suit event-driven systems: the client passes a `callbackUrl` and Cobalt POSTs the result there when the lookup finishes, removing the polling loop.

• Polling. Simple to operate, no inbound endpoint required, costs periodic status checks.

• Callback. No polling loop, lower latency to result, requires a reachable and authenticated receiver.

• Idempotent receivers. Build the callback handler to tolerate duplicate or retried deliveries without double-processing.

• Timeout policy. Decide how long a pending async lookup stays open before the file routes to manual review.

How should a lender design for partial coverage and honest limits?

What does the API not promise?

Cobalt monitors state sources and returns a clear status code when a state is temporarily unavailable, and it publishes a 99.9% availability target for its own API layer, but it does not and cannot guarantee any individual state portal's uptime. Cached data is refreshed monthly, so `liveData=false` results can lag reality. Officer and ownership fields vary by state because the API returns what the state makes available, not a normalized superset. A 404 can mean "not found in this state," not "does not exist anywhere."

Cobalt is a data source, not a decisioning engine. It returns status, filing dates, officers where available, and screenshots for the audit file. The lender owns the rules that decide what that data means. Designing as if the API will make the credit decision is a category error.

How should partial results route in the lending stack?

Graceful degradation is the design principle: when full fresh data is not available, return a smaller or older answer rather than nothing, and make the downgrade explicit.^[4] A resilient lending integration plans for failure recovery as a first-class path, not an afterthought.^[6]

Failure mode	Mitigation	Lender impact
State portal down for live lookup	Fall back to `liveData=false` cache, label as cached	Pipeline continues, reviewer sees data age
Slow state exceeds timeout	Async `retryId` or `callbackUrl`	Queue does not block on one state
Rate limited (429)	Backoff plus retry budget, honor Retry-After	Throughput preserved, no self-inflicted outage
Stale cached record	Escalate to live lookup per freshness policy	Final decisions use fresh data
State not fully supported or partial fields	Route to manual review, never silent clear	No hidden coverage gap in the file

The wrong pattern is treating a cached, partial, or unavailable result as a clean pass. The right pattern records the data source, freshness, and any degradation in the underwriting file so a future reviewer can see exactly what was known at decision time.

What should engineering build for the first production version?

What belongs in the minimum reliable integration?

Keep the first version small and explicit. The reliability comes from how failures are handled, not from extra features.

1. Cache-first waterfall: try `liveData=false`, escalate to `liveData=true` on miss or staleness.

2. Bounded retries with jitter and a retry budget on 429 and 500 responses.

3. Async delivery via `retryId` or `callbackUrl` for slow states.

4. Explicit handling for unsupported states and partial fields, routed to manual review.

5. Store the raw response, the data source, the freshness timestamp, and the decision reason.

How should partial coverage be surfaced to operations?

Operations should never have to guess why a record looks thin. The exception queue should show whether the result was live or cached, how old it is, which fields the state did not provide, and which fallback path fired. That visibility turns a degraded result into a defensible decision instead of a silent gap.

How does this fit a complete verification stack?

What runs alongside SOS data?

Reliability planning extends beyond one endpoint. Entity verification through the Secretary of State search confirms the business exists and its status.^[8] UCC discovery surfaces secured-party claims on collateral.^[9] Each source has its own coverage and its own failure behavior, so the same degradation discipline applies across the stack.

Where does the CLI help operational reliability?

For teams that want consistent retry and error handling without writing an HTTP client, the Cobalt CLI wraps the same API with retry-ID persistence on disk, a stable JSON envelope across success and error paths, and deterministic exit codes.^[10] That makes failure handling repeatable in scripts and CI jobs, where a non-zero exit and a stable error shape matter more than raw speed.

SOS API Reliability: How Cobalt Handles State Portal Outages

Why is SOS API reliability state portal outage planning a heterogeneity problem?

How should an API layer handle a single state being unavailable?

What retry and backoff behavior belongs in the integration?

How do async callbacks keep slow states from blocking a queue?

How should a lender design for partial coverage and honest limits?

What should engineering build for the first production version?

How does this fit a complete verification stack?

References

Related Articles

UCC Filing Data Quality: Why Some State Portals Are Harder to Parse

How to Build a UCC Filing Search Tool with Cobalt API

UCC Filings + SOS Status: The 2-Source Lien Verification

Explore More Categories