Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
552 views
in Technique[技术] by (71.8m points)

parsing - Processing repeatedly structured text file with python

I have a big text file structured in blocks like:

Student = {
        PInfo = {
                ID   = 0001;
            Name.First = "Joe";
            Name.Last = "Burger";
            DOB  = "01/01/2000";
        };
        School = "West High";
        Address = {
            Str1 = "001 Main St.";
            Zip = 12345;
        };
    };
    Student = {
        PInfo = {
            ID   = 0002;
            Name.First = "John";
            Name.Last = "Smith";
            DOB  = "02/02/2002";
        };
        School = "East High";
        Address = {
            Str1 = "001 40nd St.";
            Zip = 12346;
        };
        Club = "Football";
    };
    ....

The Student blocks share the same entries like "PInfo", "School" and "Address", but some of them may have additional entries, such as the "Club" information for "John Smith" which is not included for "Joe Burger". What I want to do is to get Name, School name and zip code of each student and store them in a dictionary, like

    {'Joe Burger':{School:'West High', Zip:12345}, 'John Smith':{School:'East High', Zip:12346}, ...}

Being new to python programming, I tried to open the file and analyze it line by line, but it looks so cumbersome. And the real file is quite large and more complicated than the example I posted above. I am wondering if there is an easier way to do it. Thanks ahead.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

To parse the file you could define a grammar that describes your input format and use it to generate a parser.

There are many language parsers in Python. For example, you could use Grako that takes grammars in a variation of EBNF as input, and outputs memoizing PEG parsers in Python.

To install Grako, run pip install grako.

Here's grammar for your format using Grako's flavor of EBNF syntax:

(* a file is zero or more records *)
file = { record }* $;
record = name '=' value ';' ;
name = /[A-Z][a-zA-Z0-9.]*/ ;
value = object | integer | string ;
(* an object contains one or more records *)
object = '{' { record }+ '}' ;
integer = /[0-9]+/ ;
string = '"' /[^"]*/ '"';

To generate parser, save the grammar to a file e.g., Structured.ebnf and run:

$ grako -o structured_parser.py Structured.ebnf

It creates structured_parser module that can be used to extract the student information from the input:

#!/usr/bin/env python
from structured_parser import StructuredParser

class Semantics(object):
    def record(self, ast):
        # record = name '=' value ';' ;
        # value = object | integer | string ;
        return ast[0], ast[2] # name, value
    def object(self, ast):
        # object = '{' { record }+ '}' ;
        return dict(ast[1])
    def integer(self, ast):
        # integer = /[0-9]+/ ;
        return int(ast)
    def string(self, ast):
        # string = '"' /[^"]*/ '"';
        return ast[1]

with open('input.txt') as file:
    text = file.read()
parser = StructuredParser()
ast = parser.parse(text, rule_name='file', semantics=Semantics())
students = [value for name, value in ast if name == 'Student']
d = {'{0[Name.First]} {0[Name.Last]}'.format(s['PInfo']):
     dict(School=s['School'], Zip=s['Address']['Zip'])
     for s in students}
from pprint import pprint
pprint(d)

Output

{'Joe Burger': {'School': u'West High', 'Zip': 12345},
 'John Smith': {'School': u'East High', 'Zip': 12346}}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...