python – 使用正则表达式从字符串中提取信息

这是对这个问题的后续和复杂: Extracting contents of a string within parentheses.

在那个问题中,我有以下字符串 –

"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"

我希望以(演员,角色)的形式获得一个元组列表 –

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]

为了概括问题,我有一个稍微复杂的字符串,我需要提取相同的信息.我的字符串是 –

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary), 
with Stephen Root and Laura Dern (Delilah)"

我需要格式化如下:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Stephen Root',''), ('Lauren Dern', 'Delilah')]

我知道我可以替换填充词(with,and,&,等),但不能完全弄清楚如何添加空白条目 – ” – 如果没有actor的角色名称(在这个案子Stephen Root).这样做最好的方法是什么?

最后,我需要考虑一个actor是否有多个角色,并为actor所拥有的每个角色构建一个元组.我的最后一个字符串是:

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"

我需要构建一个元组列表如下:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),    
 ('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]

谢谢.

import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

characters = splitre.split(credits)
pairs = []
for character in characters:
    if character:
        match = matchre.match(character)
        if match:
            actor = match.group(1).strip()
            if match.group(2):
                parts = splitparts.split(match.group(2))
                for part in parts:
                    pairs.append((actor, part))
            else:
                pairs.append((actor, ""))

print(pairs)

输出:

[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
 ('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
 ('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]
相关文章
相关标签/搜索