I have a large data file at https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-07/pageviews-202407-user.bz2 (3.3 GB).
I’m trying to stream its contents and process it line by line, by using requests stream=True
and iter_content()
.
The logic works fine, and I can process data for about 15-20 minutes before I get:
<SNIP>
urllib3.exceptions.IncompleteRead: IncompleteRead(123551405 bytes read, 3479106984 more expected)
The above exception was the direct cause of the following exception:
<SNIP>
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(123551405 bytes read, 3479106984 more expected)', IncompleteRead(123551405 bytes read, 3479106984 more expected))
During handling of the above exception, another exception occurred:
<SNIP>
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(123551405 bytes read, 3479106984 more expected)', IncompleteRead(123551405 bytes read, 3479106984 more expected))
Here’s my code:
@contextmanager
def get_pageview_response():
url = get_pageview_url()
with requests.get(url,
stream=True,
headers={
'User-Agent': WP1_USER_AGENT,
'Connection': 'keep-alive'
},
timeout=120) as r:
r.raise_for_status()
yield r
def raw_pageviews(decode=False):
def as_bytes():
with get_pageview_response() as r:
decompressor = BZ2Decompressor()
trailing = b''
# Read data in 128 MB chunks
for http_chunk in r.iter_content(chunk_size=128 * 1024 * 1024):
data = decompressor.decompress(http_chunk)
lines = [line for line in data.split(b'n') if line]
if not lines:
continue
# Reunite incomplete lines
yield trailing + lines[0]
yield from lines[1:-1]
trailing = lines[-1]
# Nothing left, yield the last line
yield trailing
if decode:
for line in as_bytes():
yield line.decode('utf-8')
else:
yield from as_bytes()
def pageview_components():
for line in raw_pageviews():
pass # Processing logic goes here
Based on reading the error message, my current guess is that after 15-20 minutes, my data processing (which includes I/O operations to write to a MariaDB instance) is “falling behind” the HTTP streaming. I imagine that the HTTP socket goes “idle” and the server closes the connection.
Does that sound right? Does it have anything to do with my BZ2 decompression step?
I’ve tried increasing the HTTP chunk size to 128 MB (from 16 MB) to allow for more “buffer”, but that doesn’t seem to help.