I would like to use Python to retrieve metadata stored in PDF files. I am trying to use Python xmptools
, but find that I cannot extract all the metadata. For example, this paper is available in PDF format. I have the following script that tries to extract the metadata
from xmptools import XMPMetadata, DC
xmp = XMPMetadata.fromFile("Leonard_2015_Comment_on_‘Dimensionless_units_in_the_SI’.pdf")[0]
print( xmp.getContainerItems(DC.publisher) )
This works fine. The result is [rdflib.term.Literal('IOP Publishing')]
. However, if I change the last line to
print( xmp.getContainerItems(DC.identifier) )
then I get None
as a result.
I think this may be due to the XML inside the PDF file. The data concerned with these two queries are
<dc:publisher>
<rdf:Bag>
<rdf:li>IOP Publishing</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:identifier>doi:10.1088/0026-1394/52/4/613</dc:identifier>
In the case of publisher
, the information is wrapped in RDF tags, but that is not the case for identifier
.
Is there a way for xmptools
to read simple entries where RDF tags have not been used?