Getting entities from pre-saved DocBin
Matthew Harrington
I have around 700k documents that I want to process in spacy and save into a DocBin for later use.
I wrote a code to do a keywords search using phrasematcher and it worked great. I'm now trying to build a knowledge graph out of the DocBin I have and I can't seem to be able to access the entities to use them in the graph logic. I read somewhere that DocBins don't keep that information (?) but when I print DocBin.tokens I get some values and not just an empty output.
This might be a very stupid question but I'm quite lost and the documentation does not seem to be detailed enough for this.
import spacy
from spacy.tokens import DocBin
from spacy.vocab import Vocab
nlp = spacy.load('fr_dep_news_trf')
DocBinPath = r'C:\[Redacted]\FRdocBin.nlp'
loadedDocBin = DocBin(Vocab()).from_disk(DocBinPath)
DocList=list(loadedDocBin.get_docs(nlp.vocab))
for doc in DocList People = list(set([ent.text for ent in doc.ents if ent.label_=='PERSON'])) This doesn't produce any errors but doc.ents is empty.
This is the code for saving the Docbin:
FRdoc_bin = DocBin (store_user_data=True,attrs=['ENT_TYPE','LEMMA','LIKE_EMAIL','LIKE_URL','LIKE_NUM','ORTH','POS','HEAD','DEP'])
doc = frNLP(text)
FRdoc_bin.add(doc)
FRdoc_bin.to_disk(CreatedModelPath+r'\FRdocBin'+'.nlp') 2 Answers
If you want to use custom attrs, you need both ENT_IOB and ENT_TYPE for entities.
Are you sure that you need custom attrs in the first place? Have you customized the values for LIKE_URL or other lexical attrs? If not, the default attrs for DocBin should be fine.
Edit: I figured out the issue from the spacy discussion, it's quite simply that the fr model I was using doesn't support NER. Switched fr_core_news_lg and it worked :)