I’m trying to get usable data from CD-DA with pycdio:
import os, io, cdio, pycdio
from pydub import AudioSegment
# Initialize the CD drive
cd = cdio.Device(driver_id=pycdio.DRIVER_LINUX)
cd.open()
track_num = 1
if cd.get_track( track_num ).get_format() == 'audio': # Make sure its an Audio track first
# Get necessary track information to read
lsn_start = cd.get_track( track_num ).get_lsn()
lsn_end = cd.get_track( track_num ).get_last_lsn()
nblocks = lsn_end - lsn_start + 1
read_mode = pycdio.READ_MODE_AUDIO
blocks, data = cd.read_sectors( lsn_start , read_mode, 55 ) # 0-55 only
#bytes( data ) # Doesn't work unless encoding is supplied
#io.BytesIO( data ) # Doesn't work
#data.getvalue() # Doesn't work. Not bytestring, just string
I’m specifically trying to do this without using external applications through subprocess or pipe.
I don’t seem to be able to tell what type of encoding is used in the str object returned.
I’ve tried
print( "Test chardet:", chardet.detect( data )['encoding'] )
,
but it says this is a ‘str’ object.
If I try to work with it as a string, I get:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-19: surrogates not allowed
that & bytes( data ) show it definitely isn’t utf-8
I’ve tried brute force testing what will encode with bytes() with all 97 encodings listed for Python 3.12 on: this page, as well as 16 more Python-specific encodings, listed at the same address.
for i in data_encoding:
try:
bytes(data, encoding=i)
print( i, "Good" )
I get:
utf_7 Good
punycode Good
raw_unicode_escape Good
unicode_escape Good
To be clear, I wasn’t expecting to get the data returned as type str
, and am not sure how to extract usable bytes from it without using a function like bytes(), or if I am using it incorrectly here for what I need.
Would also accept other library recommendations, if that would be better. I’m trying to get the track from disc into memory without writing to the HD, and work with it as an AudioSegment.
1
Took me longer than I’d like to admit to give in and ask the question, but it seems this is the answer:
bytes( data.encode('utf-8', errors='surrogateescape') )
Found that pycdio uses SWIG to work with libcdio, and from the Swig Documentation:
“By default, any byte string (char* or std::string) returned from C or C++ code is decoded to text as UTF-8. This decoding uses the surrogateescape error handler under Python 3.1 or higher — this error handler decodes invalid byte sequences to high surrogate characters in the range U+DC80 to U+DCFF.”
Entire command to get it to AudioSegment:
Segment = AudioSegment.from_raw( BytesIO( bytes( data.encode( 'utf-8', errors='surrogateescape' ) ) ), sample_width=2, frame_rate=44100, channels=2 )