Fixing “OpenAI Error: OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104”

Few errors are as unsettling as a cryptic, low-level failure message popping up in the middle of a perfectly ordinary API call. If you’ve seen “OpenAI Error: OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104,” you’re dealing with a connection that fell apart at the transport layer while TLS was in play. This isn’t a business-logic error and it’s not necessarily a bug in your code. It’s a symptom: somewhere between your application and OpenAI’s servers, the socket was torn down unexpectedly. The good news is that it’s solvable—and understanding what’s happening will save you hours of guesswork.

What This Error Actually Means

Let’s decode the message:

  • OpenSSL: The cryptographic library handling TLS/SSL on your side.
  • SSL_read: OpenSSL was reading bytes from the TLS session when the problem occurred.
  • SSL_ERROR_SYSCALL: A system-level I/O error surfaced during the read. This usually means the underlying TCP connection broke in a way that wasn’t a clean TLS shutdown.
  • errno 104: On Linux, this is ECONNRESET—“Connection reset by peer.” The remote side (or a proxy/NAT in the path) sent a TCP RST or the connection was otherwise reset.

In other words, TLS didn’t get a graceful “close notify.” The socket was just yanked out from under the read operation. That can happen for many reasons: idle timeouts on a proxy, a network hiccup, a misconfigured load balancer, HTTP/2 framing issues with a middlebox, or your own code aborting or stalling a streaming read.

Why It Shows Up in OpenAI Integrations

OpenAI’s APIs run over HTTPS, often using streaming (server-sent events or chunked responses) for tokens as they are generated. That streaming nature surfaces weaknesses in networks and proxies: anything that buffers, inspects, or times out long-lived connections is a candidate for a reset. You’ll see this error more when:

  • You stream responses for many seconds or minutes.
  • Your client or proxy uses aggressive idle/keep-alive timeouts.
  • You’re behind a corporate proxy, VPN, or TLS inspection device.
  • You upload large files slowly (e.g., on weak connections).
  • You reuse TCP connections heavily under high concurrency.

Quick First-Aid Checklist

If you need a fast triage path, start here:

  1. Retry the request with exponential backoff and jitter. Many resets are transient.
  2. If you’re streaming, test the same call without streaming to see if it stabilizes.
  3. Force HTTP/1.1 (disabling HTTP/2) and test again. Some proxies mishandle HTTP/2.
  4. Increase read timeouts (but keep them finite) and ensure you continuously consume the response stream.
  5. Upgrade your OpenAI SDK and TLS stack (OpenSSL, certifi/CA bundle, runtime).
  6. Bypass proxies/VPN temporarily. Try a different network (mobile hotspot) for a control test.
  7. Log at the transport level (verbose TLS/HTTP logging) to pinpoint where it breaks.

Common Root Causes and How to Fix Them

1) Idle or Inactivity Timeouts (Proxies, Load Balancers, Firewalls)

Most network intermediaries have idle or request-duration timeouts. Streaming responses that pause briefly, large uploads on slow links, or keep-alive connections can hit these limits. When a proxy closes a connection without a proper TLS shutdown, your client sees SSL_ERROR_SYSCALL with ECONNRESET.

Symptoms:

  • Resets after a predictable number of seconds (e.g., ~60 or ~120 seconds).
  • Stable for quick responses but flaky for long streams or big uploads.

Fixes:

  • Reduce the time each request spends idle. For streaming, consume data continuously. Avoid blocking your reader while doing CPU-heavy work—offload processing to a worker and keep reading.
  • Tune timeouts: increase proxy idle/request timeouts (e.g., reverse proxies, ingress controllers). Typical knobs include nginx proxy_read_timeout, keepalive_timeout; HAProxy timeout client/server/http-keep-alive; various cloud load balancer idle timeouts.
  • Use shorter-lived connections: disable connection pooling for sensitive paths or lower keep-alive lifetimes to avoid stale connections.
  • Split large uploads into smaller chunks if feasible and ensure consistent upload throughput.

2) TLS Inspection or Middleboxes Dropping Long-Lived Streams

Some corporate security devices intercept TLS (installing a corporate root CA and doing a man-in-the-middle). Others don’t intercept but still buffer or terminate long-lived connections. Either can cause RSTs during streaming or under high throughput.

Symptoms:

  • Works off-network (home or hotspot), fails on corporate network.
  • Certificate chain looks different on the problematic network, or TLS handshake logs show a non-OpenAI issuer if intercepted.

Fixes:

  • Whitelist the OpenAI domains; disable TLS interception for those hosts.
  • Use a direct CONNECT tunnel for HTTPS without bumping, if policy allows.
  • Coordinate with network/security teams to align timeouts and buffering policy with streaming needs.

3) HTTP/2 Quirks and Proxies

Some intermediaries or older client libraries mishandle HTTP/2 flow control, headers, or long-lived streams. The result can be a stream abruptly reset by a middlebox.

Fixes:

  • Force HTTP/1.1 and test: in curl use --http1.1; in Node/undici set dispatcher: new Agent({ keepAliveTimeout: ... , keepAliveMaxTimeout: ... }) and ensure HTTP/2 isn’t forced; in Python httpx, use http2=False.
  • Upgrade your runtime and TLS stack to current versions with mature HTTP/2 support.

4) CA Bundle or TLS Stack Mismatch

If the OS or runtime uses an outdated CA bundle, TLS handshake can become fragile, and renegotiation or SNI quirks may cause odd failures that surface as resets during reads.

Fixes:

  • Update OS certificates (e.g., ca-certificates package) and your language CA stores. In Python, keep certifi current. In Node, keep Node.js updated so its trust store is current.
  • Ensure SNI is enabled (it is by default in modern clients) so the correct certificate is served.

5) Connection Reuse and Pooling Edge Cases

Reusable connections stuck behind NATs or rotated through proxies can become invalid without the client noticing until the next read. The read then fails with ECONNRESET.

Fixes:

  • Proactively limit max connection age or requests-per-connection in your HTTP client.
  • On failures, drop the pool and recreate the client. Ensure your retry policy avoids reusing obviously stale connections.
  • Keep concurrency balanced; don’t saturate a tiny pool with heavy streams.

6) Not Consuming Streaming Responses Properly

With server-sent events or chunked responses, you must read data continuously. If your code pauses reads, buffers may fill and a proxy or the server may reset the connection.

Fixes:

  • Drain the stream as data arrives. In Node, wire backpressure correctly; in Python, iterate the async generator promptly.
  • Avoid blocking operations inline with the read loop; dispatch work to separate tasks and keep reading.
  • Handle aborts cleanly. If you cancel, send a proper abort signal rather than terminating the socket abruptly.

7) Large Uploads on Slow Links

Uploading large files (e.g., for batch or fine-tuning workflows) over a slow or unstable connection can exceed upstream timeouts and trigger resets.

Fixes:

  • Use resumable or chunked upload workflows where available.
  • Increase client-side connect/write timeouts to accommodate throughput.
  • Ensure the path (proxies, firewalls) isn’t imposing tiny body size or duration caps.

8) MTU and TCP-Level Problems

VPNs or tunnels that reduce the path MTU can cause fragmentation or blackholing. In rare cases, sustained issues manifest as resets under load.

Fixes:

  • Test path MTU with ping “do not fragment” flags and adjust interface MTU or enable MSS clamping on tunnels.
  • Try the same code off-VPN to see if the issue disappears.

Minimal Reproduction Tests

cURL: Baseline Connectivity and Protocol Toggle

Try a simple request with verbose logging. Replace the URL with the specific endpoint you use and provide a valid API key via header or environment variable.

curl --verbose --http1.1 https://api.openai.com/v1/models 
  -H "Authorization: Bearer YOUR_API_KEY"

If this succeeds but your app fails, investigate client configuration. If this fails on one network and succeeds on another, you likely have a network or proxy issue. Also try forcing HTTP/2:

curl --verbose --http2 https://api.openai.com/v1/models 
  -H "Authorization: Bearer YOUR_API_KEY"

Python: httpx With Timeouts and No HTTP/2

import httpx

headers = {"Authorization": "Bearer YOUR_API_KEY"}
timeout = httpx.Timeout(connect=10.0, read=60.0, write=30.0, pool=30.0)

with httpx.Client(http2=False, timeout=timeout) as client:
    r = client.get("https://api.openai.com/v1/models", headers=headers)
    print(r.status_code, len(r.text))

If this is stable but your streaming code isn’t, the problem may be in how the stream is consumed.

Node: fetch/undici With Abort and Logging

import { fetch } from "undici";

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 120000);

const res = await fetch("https://api.openai.com/v1/models", {
  method: "GET",
  headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` },
  signal: controller.signal
});

clearTimeout(timeout);
console.log(res.status, (await res.text()).length);

If your streaming handler uses ReadableStream, ensure you consume it promptly and handle backpressure.

Disable Proxies for a Control Test

Environment variables like HTTP_PROXY, HTTPS_PROXY, and NO_PROXY can silently route your traffic. For a fair test, unset them or configure NO_PROXY to include the OpenAI hosts. On Unix shells:

unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy

Observability and Diagnostics

Turn On Verbose HTTP/TLS Logging

  • cURL: --verbose or --trace-time --trace <file> to capture timing and frame details.
  • Python/httpx: Enable logging for httpx and h11/h2. For example:
    import logging
    logging.basicConfig(level=logging.DEBUG)
    
  • Node/undici: Set NODE_DEBUG=undici to get transport-level logs.

Packet Capture and TLS Introspection

  • tcpdump/wireshark: Capture traffic to see if a TCP RST arrives from an intermediate IP rather than the OpenAI host.
  • openssl s_client: Validate the TLS handshake and certificate chain:
    openssl s_client -connect api.openai.com:443 -servername api.openai.com -showcerts
    

Server Timing and Correlation

If you log request IDs and timestamps for your OpenAI calls, correlate the reset timing with application logs and network device logs. Fixed-interval resets (e.g., “always at ~60s”) often point to a specific idle timeout you can adjust.

Production-Grade Resilience Patterns

Retry with Backoff and Idempotency

  • Use exponential backoff with jitter (e.g., 250ms, 500ms, 1s, 2s, 4s, capped).
  • For requests that can safely be retried, include idempotency keys if the API supports them. For streaming, be mindful that partial outputs may have been sent; design your client to resume or tolerate duplicates where possible.

Set Explicit, Sensible Timeouts

  • Connect timeout: short (5–10s) to fail fast on network issues.
  • Read timeout: long enough to accommodate streaming pauses but finite to avoid hung threads (e.g., 60–300s depending on use case).
  • Write timeout: account for upload bandwidth.
  • Pool/keep-alive timeout: align with your proxies’ idle timeouts to avoid stale connections.

Circuit Breakers and Health Checks

  • Trip a circuit if a burst of ECONNRESETs occurs, then probe with lightweight requests to heal.
  • Distinguish between network-level failures and API-level 4xx/5xx in your metrics.

Concurrency Controls

  • Limit concurrent streams to what your network path can sustain.
  • Batch or queue background jobs to avoid thundering herds that overload pools or proxies.

Infrastructure Configuration Cheatsheet

Nginx

  • proxy_read_timeout: Extend for streaming (e.g., 300s) if policy allows.
  • keepalive_timeout: Align with client settings; avoid too-short values that drop pooled connections.
  • proxy_buffering off; for SSE; buffering can break streams or introduce latency.

HAProxy

  • timeout client, timeout server, timeout http-keep-alive: Set explicitly; many sample configs default to 60–120s.
  • option http-keep-alive and option http-buffer-request as appropriate; disable response buffering for SSE-like traffic.

Cloud Load Balancers and Ingress

  • Check idle timeout settings and whether long-lived requests are supported. Increase where permissible.
  • If using Kubernetes ingress, verify annotations for timeouts and buffering behavior (e.g., turning off response buffering for streams).

Making Streaming Robust in Clients

Python (async) Streaming Pattern

Consume tokens as they arrive, push processing to a worker, and keep reading:

import asyncio

async def process_stream(stream):
    async for chunk in stream:
        # Enqueue work and return immediately
        asyncio.create_task(handle_chunk(chunk))

async def handle_chunk(chunk):
    # CPU or I/O work
    pass

Node Streaming Pattern

Use the reader API and respect backpressure:

const reader = response.body.getReader();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  // Process promptly; avoid long blocking ops here
}

If you must do heavy work, send it to a worker thread or queue and keep the read loop flowing.

Keeping Your TLS and Certificate Chain Healthy

  • Update OpenSSL and runtime regularly to receive fixes for subtle TLS issues.
  • On Linux, ensure ca-certificates is current; in Python, keep certifi updated; in Node, update Node.js for current CAs.
  • Verify SNI is set to the host you’re connecting to; most libraries handle this automatically when you pass the correct hostname.
  • If behind corporate TLS interception, import the corporate root CA properly or, ideally, bypass interception for OpenAI domains.

Distinguishing Client, Network, and Server Causes

Since an SSL_read reset collapses before an HTTP response code can be returned, you won’t see a 4xx or 5xx. Use these clues:

  • Only your app fails, curl works: Client configuration (pooling, HTTP/2, streaming handling).
  • Fails on corporate network, works elsewhere: Proxy/VPN/TLS inspection or path MTU.
  • Fails at a consistent second count: Idle/request timeout at a middlebox.
  • Random but rare: Transient network blips; increase retries and resilience.

A Practical Step-by-Step Troubleshooting Flow

  1. Reproduce with curl on the same host. If curl succeeds, move on; if not, try another network to isolate a path issue.
  2. Force HTTP/1.1 and retest. If stable, keep HTTP/1.1 or upgrade runtime to improve HTTP/2 stability.
  3. Disable proxies/VPN and retest. If it fixes it, engage network admins to adjust timeouts and buffering or whitelist destinations.
  4. Upgrade your SDK/runtime/CA bundle. Old stacks cause subtle TLS issues.
  5. Adjust client timeouts and streaming consumption. Keep reading the stream; avoid long pauses.
  6. Reduce connection reuse. Lower keep-alive durations or recreate clients after failures.
  7. Instrument with verbose logs and, if necessary, packet captures to catch a TCP RST and identify the sender.

Real-World Patterns and How Teams Resolved Them

Corporate Proxy Resetting Streams at 60 Seconds

A team streaming token-by-token responses noticed disconnects around 60 seconds. Curl on a home network worked perfectly. TLS logs indicated a TCP RST from an internal proxy IP. Fix: they requested a higher idle timeout and disabled response buffering for SSE on the proxy. As a fallback, they forced HTTP/1.1 for streaming endpoints and kept HTTP/2 for non-streaming requests.

VPN MTU Causing Intermittent Resets

Under a VPN, large responses occasionally reset. Packet captures showed fragmentation and ICMP “fragmentation needed” messages discarded by a firewall. Fix: enabled MSS clamping on the VPN and reduced MTU to 1400 on the client interface. Resets vanished.

Pooling With Stale Connections Through NAT

High-concurrency workers reused keep-alive connections for hours. A NAT device recycled mappings, causing silent half-dead connections that failed only on next read. Fix: set a max connection age (e.g., 60 seconds), lowered keep-alive timeouts, and added a retry-once policy that drops the pool on ECONNRESET. Failures dropped dramatically.

Buffered Ingress Breaking SSE

A Kubernetes ingress buffered responses by default. Streaming endpoints delivered nothing for tens of seconds and then reset. Fix: set ingress annotations to disable response buffering and increase read timeouts; the stream became stable.

Code Patterns That Reduce SSL_ERROR_SYSCALL Incidents

Python httpx: Client With Tuned Timeouts and Retry

import httpx, time, random

def backoff(attempt, base=0.25, cap=8):
    return min(cap, base * (2 ** attempt)) + random.uniform(0, base)

def call_with_retry(url, headers):
    for attempt in range(6):
        try:
            with httpx.Client(http2=False, timeout=httpx.Timeout(10, 120, 60, 30)) as client:
                return client.get(url, headers=headers)
        except (httpx.ReadError, httpx.ConnectError, httpx.WriteError) as e:
            if attempt == 5:
                raise
            time.sleep(backoff(attempt))

resp = call_with_retry("https://api.openai.com/v1/models", {"Authorization": "Bearer YOUR_API_KEY"})
print(resp.status_code)

Node undici: Retry and Backpressure-Aware Streaming

import { fetch } from "undici";

async function withRetry(url, options, attempts = 5) {
  let lastErr;
  for (let i = 0; i < attempts; i++) {
    try {
      const res = await fetch(url, options);
      if (!res.ok) throw new Error(`HTTP ${res.status}`);
      return res;
    } catch (e) {
      lastErr = e;
      await new Promise(r => setTimeout(r, Math.min(8000, 250 * (2 ** i)) + Math.random() * 250));
    }
  }
  throw lastErr;
}

const res = await withRetry("https://api.openai.com/v1/models", {
  headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` }
});

console.log(res.status);

Testing Streaming Without Business Logic

To isolate whether your processing loop is causing stalls, write a “drain-only” test that just reads and discards:

// Node
const res = await fetch("https://your-streaming-endpoint", { /* headers */ });
const reader = res.body.getReader();
while (true) {
  const { done } = await reader.read();
  if (done) break;
  // do nothing
}

If the drain-only test is stable but your real handler isn’t, the issue is likely backpressure or blocking work inside your loop.

Security and Compliance Considerations

  • If you must use TLS interception for policy reasons, ensure the interceptor’s CA is installed and updated, and configure it to pass through long-lived streaming responses without buffering.
  • Keep dependencies updated to receive security patches that also address stability and protocol edge cases.
  • Log errors without recording sensitive payloads. Store timing and connection details sufficient for troubleshooting.

Operational Playbook for Teams

  • Runbooks: Document the curl tests, HTTP/1.1 toggle, proxy bypass steps, and how to collect logs quickly.
  • SLOs: Define acceptable transient error rates and ensure your retries and circuit breakers meet them.
  • Dashboards: Track connection resets separately from HTTP errors. Graph by network segment or datacenter if applicable.
  • Chaos Testing: Intentionally inject resets and slowdowns in a staging environment to verify resilience strategies.

When It’s Probably Not Your Fault

Even with perfect client and network configuration, the internet is lossy. Undersea cable events, ISP peering hiccups, and transient congestion can trigger resets. If your telemetry shows isolated, non-correlated resets that succeed on retry, the best strategy is graceful handling:

  • Make operations idempotent or deduplicate results when replayed.
  • Use retry budgets and cap total request time to preserve user experience.
  • Favor streaming only when it benefits UX; otherwise consider non-streaming endpoints that are less sensitive to middlebox behavior.

Checklist You Can Apply Today

  • Force HTTP/1.1 temporarily; test if the error disappears.
  • Upgrade your OpenAI SDK, runtime, OpenSSL, and CA bundles.
  • Tune timeouts: connect (5–10s), read (60–300s), write (30–120s), and pool lifetimes.
  • Consume streams without blocking; implement backpressure-aware patterns.
  • Implement retries with exponential backoff and jitter; drop stale pools between attempts.
  • Disable response buffering on proxies for streaming; raise idle timeouts where allowed.
  • Bypass corporate proxies/VPNs for a control test; if it fixes it, coordinate a network-side change.
  • Capture verbose logs and, if needed, a packet trace to identify a TCP RST sender.

Key Takeaways for Engineering Teams

  • “SSL_read: SSL_ERROR_SYSCALL, errno 104” means the socket was reset mid-read; it’s almost always a transport-path issue.
  • Streaming workloads expose timeout and buffering problems in proxies and clients—keep the data flowing and avoid blocking reads.
  • Protocol switches (HTTP/2 vs HTTP/1.1), connection pooling policies, and timeouts are powerful levers to stabilize traffic.
  • Most incidents are fixable without changing server-side logic: adjust clients, proxies, or network settings and add resilient retries.

Comments are closed.

 
AI
Petronella AI