SED: multiple patterns on the same line, how to match/parse first one

Question

I have a file, which holds phone number data, and also some useless stuff. I'm trying to parse the numbers out, and when there is only 1 phone number / line, it's not problem. But when I have multiple numbers, sed matches the last one (even though everywhere it says it should match only match the first pattern?), and I can't get other numbers out..

My data.txt:

bla bla bla NUM:09011111111 bla bla bla bla NUM:08022222222 bla bla bla

When I parse for the data, my idea was first to remove all the "initial" "bla bla bla" in front of the first phone number (so I search for first occurrence of 'NUM:'), then I remove all the stuff after phone number, and get the number. After that I want to parse the next occurrence from the leftover string.

So now when I try to sed it, I always get the last number on the line:

>sed 's/.*NUM://' data.txt
08022222222 bla bla bla
>

Primarily I would like to understand what's wrong with my understanding of SED. Of course more efficient suggestions are welcome! Doesn't my sed command say, replace all stuff before 'NUM:' with '' (empty)? Why it matches always the last occurrence ?

Thanks!

Sed is greedy. If there is a second NUM:, the first can be consumed by .* . — user unknown, Mar 13 '12 at 9:37
+1 for sample data, implied expected output AND some sample code that's not working. Good luck. — shellter, Mar 13 '12 at 15:37

potong · Accepted Answer · 2012-03-13 10:02:02Z

This might work for you:

echo "bla bla bla NUM:09011111111 bla bla bla bla NUM:08022222222 bla bla bla" |
sed 's/NUM:/\n&/g;s/[^\n]*\n\(NUM:[0-9]*\)[^\n]*/\1 /g;s/.$//'
NUM:09011111111 NUM:08022222222

The problem you have is understanding that the .* is greedy i.e. it matches the longest match not the first match. By placing a unique character (\n sed uses it as a line delimiter so it cannot exist in the line) in front of the string we're interested in (NUM:...) and deleting everything that is not that unique character [^\n]* followed by the unique character \n, we effectively split the string into manageable pieces.

I was suspecting this was something to do with greediness indeed. — julumme, Mar 14 '12 at 2:40
Wow that answers my hours-long search for an example of character-based as opposed to line-based sed work. I see we stick the newline character as a marker in the line-based pattern space, then remove parts that end with that marker to counter sed's greedy match. — eel ghEEz, Aug 19 '16 at 23:36
OSX: The '\n' doesn't work for sed. Use 'gsed' (installable with Brew) instead. — ericpeters0n, May 10 '17 at 18:17

Eduardo Ivanec · Answer 2 · 2012-03-13 11:35:20Z

As you know by now, sed regexes are greedy and as far as I can tell can't be made non-greedy.

Two alternatives that haven't been brought up until now are to just use other tools for this kind of matching/extraction.

You can use perl as a drop-in replacement for sed with the -pe parameters. It supports the ?non-greedy modifier:

$ perl -pe 's/.*?NUM://' data.txt
09011111111 bla bla bla bla NUM:08022222222 bla bla bla

You can use the -o option to GNU grep to get only the bits of your data that match the regex:

$ egrep -o 'NUM:[0-9]*' data.txt 
NUM:09011111111
NUM:08022222222

Thank you for suggesting an alternative, I will definitely look into the possible performance differences between sed and perl — julumme, Mar 14 '12 at 2:38
Thanks for the egrep suggestion. Too bad sed limits itself to pattern spaces taking entire lines. — eel ghEEz, Aug 19 '16 at 23:44

jfg956 · Answer 3 · 2012-03-13 23:01:00Z

If a number is defined by digits following a NUM::

sed -n -e 's/$/\n/' -e ':begin' \
  -e 's/\(NUM:[0-9][0-9]*\)\(.*\)\n\(.*\)/\2\n\3 \1/' \
  -e 'tbegin' -e 's/.*\n //' -e '/NUM/p'

What this does is:

Put a \n at the end of the line to act as a marker.
Try to find a number before the marker, and put it at the end of the line (after the marker).
If a number was found, goto 2 above.
When no number are left before the marker, remove everything before the numbers.
If a number is on the line, print it (to handle the case where no number are found.

It can also be done the other way around, first dropping lines without numbers:

sed  -e '/NUM/!d' -e 's/$/\n/' -e ':begin' \
  -e 's/\(NUM:[0-9][0-9]*\)\(.*\)\n\(.*\)/\2\n\3 \1/' \
  -e 'tbegin' -e 's/.*\n //'

I appreciate you taking time to give me an alternative solution, I will study it. However, it seems a bit difficult to understand though, and also there are quite many calls to sed here, I worry that performance is slower than in a "3-call solution" — julumme, Mar 14 '12 at 2:49
There is a single call to sed, only a little more complex script with 6 commands. You are right, potong's solution only have 3 commands, but those commands are executed more than once (the g argument to the s command), so it does not mean that it is faster. I agree that it is a little more elegant for this problem. — jfg956, Mar 14 '12 at 7:17

kev · Answer 4 · 2012-03-13 09:47:41Z

You can use this pattern:

sed -r 's/^(.*NUM:)(.*NUM:.*)$/\2/'

You are here

SED: multiple patterns on the same line, how to match/parse first one 两个条件与多个条件与多条件匹配

SED: multiple patterns on the same line, how to match/parse first one

4 Answers 正确答案

友情链接

搜索表单

用户登录

You are here

SED: multiple patterns on the same line, how to match/parse first one 两个条件与 多个条件与多条件匹配

SED: multiple patterns on the same line, how to match/parse first one

4 Answers 正确答案

友情链接

SED: multiple patterns on the same line, how to match/parse first one 两个条件与多个条件与多条件匹配