Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
727 views
in Technique[技术] by (71.8m points)

python - Differentiate between countries and cities in spacy NER

I'm trying to extract countries from organisation addresses using spacy NER, however, it labels countries and cities with the same tag GPE. Is there any way I can differentiate them?

for instance:

nlp = en_core_web_sm.load()

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

for ent in doc.ents:
    if ent.label_ == 'GPE':
        print(ent.text)

gives back

Tempe
AZ
United States
United States
Tempe
AZ
United States
Tempe
AZ
United States
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

As other answers have mentioned, GPE for the pre-trained Spacy model is for countries, cities and states. However, there is a workaround and I'm sure several approaches can be used.

One approach: You could add a custom tag to the model. There is a good article on Towards Data Science that could help you do that. Gathering training data for this could be a hassle as you would need to tag cities/countries per their respective location in the sentence. I quote the answer from Stack Overflow:

Spacy NER model training includes the extraction of other "implicit" features, such as POS and surrounding words.

When you attempt to train on single words, it is unable to get generalized enough features to detect those entities.

An easier workaround to this could be the following:

Install geonamescache

pip install geonamescache

Then use the following code to get the list of countries and cities

import geonamescache

gc = geonamescache.GeonamesCache()

# gets nested dictionary for countries
countries = gc.get_countries()

# gets nested dictionary for cities
cities = gc.get_cities()

The documentation states that you can get a host of other location options as well.

Use the following function to get all the values of a key with a certain name from a nested dictionary (obtained from this answer)

def gen_dict_extract(var, key):
    if isinstance(var, dict):
        for k, v in var.items():
            if k == key:
                yield v
            if isinstance(v, (dict, list)):
                yield from gen_dict_extract(v, key)
    elif isinstance(var, list):
        for d in var:
            yield from gen_dict_extract(d, key)

Load up two lists of cities and countries respectively.

cities = [*gen_dict_extract(cities, 'name')]
countries = [*gen_dict_extract(countries, 'name')]

Then use the following code to differentiate:

nlp = spacy.load("en_core_web_sm")

doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')

for ent in doc.ents:
    if ent.label_ == 'GPE':
        if ent.text in countries:
            print(f"Country : {ent.text}")
        elif ent.text in cities:
            print(f"City : {ent.text}")
        else:
            print(f"Other GPE : {ent.text}")

Output:

City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...