Diagnosing and Fixing “OpenAI Error: OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104”

Few errors cause as much confusion as a TLS read failure that bubbles up from OpenSSL with “SSL_ERROR_SYSCALL” and Linux errno 104. When this pops up during an OpenAI API call—often in the middle of a streaming response—it can feel like the server suddenly “hung up” without explanation. Good news: the error is understandable, diagnosable, and usually fixable. This guide unpacks what it means, why it happens, and how to eliminate it with practical steps, language-specific recipes, and real-world examples.

What the Error Actually Means

Let’s decode the message:

  • OpenSSL SSL_read: The TLS library was reading from a secure socket.
  • SSL_ERROR_SYSCALL: OpenSSL encountered a system-level I/O error, not a clean TLS alert.
  • errno 104: On Linux, 104 maps to “Connection reset by peer,” meaning the remote side closed the TCP connection abruptly.

This is not the same as the server sending a normal TLS “close_notify” or an HTTP error code. The connection was torn down mid-flight. Typical causes include network blips, proxies, load balancers timing out idle connections, abrupt server-side terminations under load, or client environment issues like outdated CA bundles or IPv6 path problems.

Why It Appears Frequently With OpenAI

The OpenAI API uses HTTPS and often streams responses for chat completions, audio, and vision. Streaming keeps a connection open for longer than a small JSON response would, exposing it to:

  • NAT or load balancer idle timeouts.
  • Corporate TLS proxies that interrupt connections or reject modern TLS features.
  • Spotty Wi-Fi or cellular networks resetting long-lived TCP connections.
  • HTTP/2 quirks with certain intermediaries.
  • Mismatched TLS settings, cipher support, or CA store issues in constrained environments (containers, older distros).

A Fast Triage Checklist

Before changing code, run through this quick diagnostic list:

  1. Is it intermittent or 100% reproducible?
  2. Does it only happen on certain networks (office, VPN, mobile) or only in containers/servers?
  3. Does it happen only during streaming or also with non-streaming endpoints?
  4. How long into the request does it fail (e.g., 60s, 300s)? Time patterns hint at idle timeouts.
  5. Does it vanish if you try on a different ISP or via a tethered phone?
  6. Does forcing IPv4 or HTTP/1.1 change the outcome?
  7. Do you see any partial chunks or headers before the reset?

Determining Where the Fault Lies

Although the error is surfaced by your client, the root cause can be anywhere along the path—from your code to your kernel, a local proxy, your company firewall, ISP NATs, CDNs, or the OpenAI edge. Narrow it down with a few tests:

  • Run curl with verbose logging and alternate protocols:
    curl -v https://api.openai.com/v1/models 
      -H "Authorization: Bearer YOUR_KEY" 
      --http1.1 --retry 3 --retry-all-errors
    
  • Try forcing IPv4 if IPv6 paths are flaky:
    curl -v -4 https://api.openai.com/v1/models -H "Authorization: Bearer YOUR_KEY"
  • Test outside your app’s environment: directly on your laptop vs. container vs. CI runner.
  • Inspect TLS with openssl s_client (it won’t do HTTP, but you can spot cert or handshake issues):
    openssl s_client -connect api.openai.com:443 -servername api.openai.com -tls1_2
  • Check DNS resolution consistency (dig/nslookup): if multiple IPs resolve and only one path fails, consider pinning to IPv4 temporarily or improving network routing.

Typical Root Causes

  • Idle timeouts: Cloud load balancers/NATs drop connections after N seconds without traffic.
  • Proxy interference: TLS inspection, HTTP/2 downgrades, or CONNECT misconfigurations.
  • HTTP/2 path issues: Some middleboxes mishandle long-running streams or server-sent events.
  • CA and TLS version mismatches: Outdated ca-certificates or OpenSSL versions in containers.
  • IPv6 blackholes: DNS resolves AAAA, but IPv6 path is broken; the server “vanishes.”
  • Bandwidth or MTU quirks: Large payloads or path MTU issues causing resets.

Fixes and Workarounds That Usually Help

1) Implement robust retries with backoff and idempotency

A reset is often transient. For non-streaming calls, retries with exponential backoff and jitter work well. Include an idempotency key for operations that may create resources or charges to prevent duplicates.

2) Tune timeouts and keep the connection alive

  • Set sensible read timeouts long enough for your use case.
  • Use TCP keep-alives or HTTP/2 pings to prevent idle timeouts on long-running streams.
  • Consider chunking requests and reducing response durations (e.g., limit max_tokens).

3) Try HTTP/1.1 if HTTP/2 is unstable on your path

Some environments mis-handle HTTP/2. Falling back to HTTP/1.1 for testing can clarify the issue. Permanently pinning to HTTP/1.1 is not ideal but can be an effective workaround.

4) Bypass or correctly configure proxies

  • Set NO_PROXY or equivalent to bypass corporate proxies for api.openai.com if allowed.
  • Disable TLS inspection for api.openai.com or ensure the proxy supports the required TLS features.
  • Ensure the proxy passes through CONNECT tunnels cleanly without altering SNI.

5) Update TLS/CA components

  • Update ca-certificates and OpenSSL in your container or OS.
  • Use a modern runtime (Python httpx, Node undici, Go net/http) with current TLS defaults.

6) Prefer IPv4 if IPv6 is unreliable

If you observe failures only on IPv6 paths, force IPv4 temporarily at the client or system resolver level until your network is fixed.

7) Shorten long-lived streams

For streaming generations, consider splitting work into smaller chunks or time-bound partial outputs to avoid long idle gaps that trigger middlebox timeouts.

Python: requests/httpx/OpenAI SDK

Modern OpenAI Python SDKs use httpx under the hood. You can customize transport, timeouts, retries, and HTTP/2 settings.

# httpx example with tuned timeouts and retries
import httpx
from backoff import on_exception, expo

timeout = httpx.Timeout(connect=10.0, read=120.0, write=30.0, pool=10.0)
transport = httpx.HTTPTransport(retries=0)  # we'll handle retries with backoff
client = httpx.Client(http2=True, timeout=timeout, transport=transport)

@on_exception(expo, (httpx.ConnectError, httpx.ReadError, httpx.RemoteProtocolError), max_tries=5)
def call_openai():
    resp = client.get("https://api.openai.com/v1/models", headers={"Authorization": f"Bearer {API_KEY}"})
    resp.raise_for_status()
    return resp.json()

# If HTTP/2 issues are suspected:
client_h1 = httpx.Client(http2=False, timeout=timeout)

For the OpenAI Python SDK (v1+), you can bring your own httpx client:

from openai import OpenAI
import httpx

timeout = httpx.Timeout(connect=10, read=120, write=30, pool=10)
client = OpenAI(
    http_client=httpx.Client(http2=True, timeout=timeout)
)

# Streaming example with defensive retries
def stream_chat(messages):
    try:
        with client.chat.completions.stream(
            model="gpt-4o-mini",
            messages=messages,
        ) as stream:
            for event in stream:
                # handle chunks; persist partials so you can recover on reset
                pass
    except Exception as e:
        # retry policy and fallback handling here
        raise

If resets correlate with ~300s into a stream, try keeping the stream active by emitting tokens more frequently (reduce max_tokens) or by splitting the task into smaller calls.

Node.js: fetch/undici/OpenAI SDK

Node’s built-in fetch uses undici. Enable keep-alive, set timeouts, and implement retries. If a proxy is present, configure a tunnel agent or bypass it for api.openai.com.

import { setGlobalDispatcher, Agent } from "undici";

setGlobalDispatcher(new Agent({
  keepAliveTimeout: 30_000,
  keepAliveMaxTimeout: 60_000,
  headersTimeout: 120_000,
  connectTimeout: 10_000
}));

async function callModels() {
  const resp = await fetch("https://api.openai.com/v1/models", {
    headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` }
  });
  if (!resp.ok) throw new Error(`HTTP ${resp.status}`);
  return await resp.json();
}

// Simple backoff
async function withRetry(fn, attempts = 5) {
  let delay = 250;
  for (let i = 0; i < attempts; i++) {
    try { return await fn(); }
    catch (e) {
      if (i === attempts - 1) throw e;
      await new Promise(r => setTimeout(r, delay + Math.random()*delay));
      delay *= 2;
    }
  }
}

If you suspect HTTP/2 middlebox issues and you’re using a library that negotiates h2, force HTTP/1.1 where possible or test via curl to confirm.

cURL and CLI Diagnostics

Use cURL to validate connectivity, change protocols, and enable retries:

# Verbose, with retries and HTTP/1.1
curl -v --http1.1 --retry 5 --retry-all-errors --max-time 120 
  -H "Authorization: Bearer $OPENAI_API_KEY" 
  https://api.openai.com/v1/models

# Force IPv4
curl -v -4 -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

# TCP keepalive
curl -v --tcp-keepalive -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

Java: OkHttp

OkHttp’s defaults are robust, but you can add pings and tune timeouts:

OkHttpClient client = new OkHttpClient.Builder()
    .retryOnConnectionFailure(true)
    .pingInterval(15, TimeUnit.SECONDS)      // helpful for long-lived streams
    .callTimeout(2, TimeUnit.MINUTES)
    .connectTimeout(10, TimeUnit.SECONDS)
    .readTimeout(2, TimeUnit.MINUTES)
    .build();

// If HTTP/2 is problematic:
// client = client.newBuilder().protocols(Arrays.asList(Protocol.HTTP_1_1)).build();

Go: net/http and HTTP/2 tuning

Go’s http.Client can be tuned to survive transient resets and keep connections healthy. For HTTP/2, configure read idle timeouts and pings (Go 1.20+ http2 package):

transport := &http.Transport{
  TLSHandshakeTimeout: 10 * time.Second,
  IdleConnTimeout:     90 * time.Second,
  MaxIdleConns:        100,
  ForceAttemptHTTP2:   true,
}
client := &http.Client{Transport: transport, Timeout: 2 * time.Minute}

// For HTTP/2-specific keepalives:
import "golang.org/x/net/http2"
http2Transport, _ := http2.ConfigureTransports(transport)
http2Transport.ReadIdleTimeout = 30 * time.Second
http2Transport.PingTimeout = 10 * time.Second

If your path has issues with HTTP/2, set ForceAttemptHTTP2 to false to test HTTP/1.1 behavior.

Kubernetes and Cloud Load Balancer Caveats

Many cloud load balancers enforce idle timeouts (e.g., AWS NLB/ALB defaults around several minutes). If a stream has lulls longer than the idle timeout, the L4 or L7 device may reset the connection. Mitigations:

  • Use shorter-lived calls to the API and stitch results.
  • Enable TCP keep-alives on clients and set keepalive intervals below the idle timeout.
  • If you control the load balancer, raise idle timeouts or enable HTTP/2 keepalive pings.
  • Avoid chaining multiple proxies; each hop introduces its own timeout.

Corporate Proxy and TLS Inspection

Enterprises often deploy TLS inspection. Some middleboxes disrupt newer TLS ciphers, SNI, or HTTP/2 streams, causing resets. Options:

  • Request an exception rule for api.openai.com to bypass inspection.
  • Use a direct egress path or NO_PROXY configuration.
  • If a proxy is mandatory, ensure it supports CONNECT tunneling without altering TLS parameters.
  • Check for a company root CA requirement; if so, add it to your trusted store (for inspected traffic only).

CA Store, OpenSSL, and Container Hygiene

Minimal containers sometimes lack updated CA bundles or ship with older OpenSSL. Symptoms include handshake errors or intermittent resets during renegotiation or cert checks. Fixes:

  • Install and regularly update ca-certificates.
  • Use recent base images (e.g., Debian/Ubuntu LTS or Wolfi/Alpine with current openssl/libressl and curl).
  • Pin to a stable runtime version; avoid old Python/Node/Go that default to deprecated TLS settings.

IPv6 Versus IPv4

Public DNS may return both A and AAAA records. Some networks advertise IPv6 but do not route it correctly. If failures disappear with -4 on curl, adjust your app or OS to prefer IPv4 until IPv6 is fixed, or implement Happy Eyeballs–style dialing with timeouts that favor the working path.

Large Payloads and Streaming Strategies

Sending huge prompts or uploading large files over shaky links increases reset risk. Practical tips:

  • Compress uploads if supported; otherwise split content into smaller chunks.
  • Reduce max_tokens or stream options to produce earlier tokens, minimizing idle time.
  • Persist partial results as they arrive; if a reset occurs, you can resume by reconstructing context.

Observability: What to Log

When a reset happens, capture:

  • Timestamp and endpoint used.
  • Whether streaming was enabled and how far into the stream it failed.
  • Response headers received so far, especially any Request-Id.
  • Network path specifics: IP version, proxy in use, region/availability zone, container vs. host.
  • Retry attempts and backoff timing.

Enable debug logging in your HTTP client. For Python httpx, set HTTPX_LOG_LEVEL=debug; in Node (undici), DEBUG=undici:*; in curl, use -v or –trace-time.

Differentiating Server Overload From Network Breakage

Under load, servers might shed connections. If the reset correlates with peak usage and disappears on retry, treat it like a transient server-side failure: implement exponential backoff with jitter and idempotency keys. If the reset correlates with precise durations (e.g., exactly 300 seconds), suspect an idle timeout along the path. If it only occurs on one network, it’s likely a proxy or routing issue.

Real-World Example 1: Kubernetes Behind an Aggressive Load Balancer

Symptoms: Streaming chat responses fail around the five-minute mark with SSL_ERROR_SYSCALL, errno 104. Root cause: an egress NAT or load balancer idled out connections at 300 seconds without traffic. Fixes implemented:

  • Reduced max_tokens to keep streams short.
  • Implemented client keepalive and HTTP/2 pings every 15 seconds.
  • Added retries for non-streaming steps and stored partial streaming chunks.

Result: No more resets, and user-perceived latency improved due to shorter streaming windows.

Real-World Example 2: Corporate TLS Inspection

Symptoms: Consistent failures only on the office network; working fine on mobile tethering. The proxy performed TLS interception but didn’t support certain TLS extensions used during long-lived HTTP/2 streams. Resolution:

  • Network team exempted api.openai.com from inspection.
  • Client temporarily forced HTTP/1.1 and IPv4 until the exemption went live.

Result: Stable streaming and batch requests.

Real-World Example 3: IPv6 Blackhole

Symptoms: Random SSL_ERROR_SYSCALL errors on a café Wi-Fi. curl -4 worked reliably; curl without -4 failed intermittently. Root cause: Access point advertised IPv6 but didn’t route consistently. Workaround: Force IPv4 for production traffic and add health checks that detect broken IPv6 to auto-fallback.

A Step-by-Step Diagnostic Playbook

  1. Validate basic connectivity:
    dig api.openai.com +short
    nslookup api.openai.com
  2. Test with curl verbose and retries:
    curl -v --retry 5 --retry-all-errors 
      -H "Authorization: Bearer $OPENAI_API_KEY" 
      https://api.openai.com/v1/models
  3. Force IPv4 and HTTP/1.1:
    curl -v -4 --http1.1 -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models
  4. Try a different network (tethered phone) to isolate corporate/VPN issues.
  5. Check CA store and OpenSSL versions in your runtime/container; update if old.
  6. Disable or bypass proxies; set NO_PROXY for api.openai.com if allowed.
  7. Instrument your client with extended timeouts, keepalive, and retries; verify if the error frequency drops.
  8. For persistent streaming failures at fixed intervals, adjust max_tokens, add keepalive pings, or split the stream.

Resilience Patterns for Production

  • Exponential backoff with jitter and a max retry cap; log a unique correlation ID per attempt.
  • Idempotency keys for requests that might be retried to avoid duplicate charges or work.
  • Circuit breakers: if a downstream path is unstable, fail fast and fallback to a cached or simplified response.
  • Timeout budgets: coordinate timeout values across layers to avoid “retry storms.”
  • Partial result persistence for streaming so you can restart from a known state.
  • Connection pools: avoid reconnecting for every request; healthy pools reduce handshake exposure.

Language-Specific Tips at a Glance

  • Python:
    • httpx: configure Timeout, use Client with http2=True or false to test; add backoff; inject custom transports.
    • requests: mount a HTTPAdapter with urllib3 Retry; set read and connect timeouts; consider switching to httpx for better HTTP/2 support.
  • Node:
    • undici Agent keepAlive; tune timeouts; implement retry wrappers; ensure proxy tunnel configuration is correct.
    • For streams, handle error events and close signals cleanly; persist progress.
  • Java:
    • OkHttp retryOnConnectionFailure, pingInterval, callTimeout; explicitly set HTTP/1.1 if needed.
  • Go:
    • http.Transport tuning; http2.ConfigureTransports with ReadIdleTimeout and PingTimeout; test ForceAttemptHTTP2 false.
  • cURL:
    • –retry-all-errors, –http1.1, -4, –tcp-keepalive, and verbose mode for insights.

Security and Compliance Considerations

When dealing with corporate proxies and TLS interception, ensure you understand organizational policy before bypassing inspection. If you must install a corporate root CA, restrict its trust to the environment where it’s required. Never disable certificate verification entirely in production. If you’re capturing packet traces for debugging, avoid including sensitive payloads or redact them before sharing.

Working Around Long-Running Generation Tasks

Long tasks increase the probability of a mid-stream reset. Engineering approaches that help:

  • Break the prompt into stages; commit intermediate outputs and move forward iteratively.
  • Use shorter generation windows with state carried forward in your application to reduce stream duration.
  • If you need continuous output, send periodic heartbeats (or use transports supporting keepalive pings) to prevent idle drops.

Testing Without Your App

To eliminate app-specific issues:

  • Use curl to hit the same endpoints with comparable headers and payload sizes.
  • Test on a different OS or device to rule out local firewall/AV interference.
  • Spin up a simple script using a different language/client; if the issue persists, it’s likely environmental.

Handling Partial Streams Gracefully

When a streaming connection resets, consider what partial content you have already emitted to the user. To improve UX:

  • Show partial results with a transient warning and an automatic “reconnect” attempt that continues the conversation context.
  • Persist each chunk to a buffer so you can reconcile after a retry.
  • If the model supports it, constrain continuation with a brief recap of the last tokens to avoid duplication.

Networking Knobs That Matter

  • TCP keepalive intervals: lower than any NAT/LB idle timeouts in your path.
  • DNS caching: avoid long stale caches that point to a degraded edge; use short TTL respect.
  • MTU: mismatches can cause fragmentation-related pain; if you suspect this, test with smaller packets or PMTUD settings.

Identifying Proxy Use Accurately

Many runtime environments silently pick up proxy settings (HTTP_PROXY, HTTPS_PROXY, NO_PROXY). Confirm at runtime what your client sees and whether CONNECT tunnels are formed. If a proxy is unavoidable, test shorter keepalive lifetimes and ensure it supports long-lived streams.

What to Include When Escalating

If you need to escalate through your network team or to vendor support, gather:

  • Timestamps with timezone and correlation IDs (if present in headers).
  • Whether the issue is reproducible, and steps to reproduce.
  • curl -v output with redacted tokens.
  • Client version and OS/container image details, OpenSSL/httpx/undici versions.
  • Whether HTTP/1.1 or IPv4 mitigates the issue.
  • Network topology notes: proxy in path, VPN, VPC egress, load balancer types, idle timeouts.

A Practical End-to-End Checklist

  1. Update runtime and CA bundles in the environment running your client.
  2. Implement exponential backoff retries with jitter and idempotency where applicable.
  3. Set explicit timeouts suitable for your workloads; add keepalive signals for long streams.
  4. Test HTTP/1.1 and IPv4 to isolate protocol/routing issues.
  5. Bypass or correctly configure proxies; disable TLS inspection for api.openai.com if allowed.
  6. Shorten long-running streams; limit max_tokens and send work in smaller batches.
  7. Log request IDs, timing, partial outputs, and environment details to aid debugging.
  8. If using Kubernetes or cloud egress, align idle timeouts and keepalive intervals end-to-end.

Final Notes on Stability Versus Performance

Reliability often involves trade-offs. For the most stable path, prioritize robust retries, conservative timeouts, shorter streams, and explicit keepalives. Once stable, iteratively reintroduce optimizations: enable HTTP/2 for efficiency, adjust keepalive intervals to reduce chatter, and batch requests where it doesn’t impact user experience. By combining environment fixes (proxies, IPv6, CA updates) with application-level resilience (retries, idempotency, partial persistence), the OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104 error should move from frequent frustration to an occasional, well-handled event.

Comments are closed.

 
AI
Petronella AI