How to remove particular words from lines of a text file?

my text file looks like this:

Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341

now I want to delete Liquid penetration 95% mass (m) from my lines to obtain the values only. How should I do it?

8 Answers

If there's only one = sign, you could delete everything before and including = like this:

$ sed -r 's/.* = (.*)/\1/' file
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

If you want to change the original file, use the -i option after testing:

sed -ri 's/.* = (.*)/\1/' file

Notes

-r use ERE so we don't have to escape ( and )
s/old/new replace old with new
.* any number of any characters
(things) save things to backreference later with \1, \2, etc.

This is a job for awk; assuming the values occur in last field only (as per your example):

awk '{print $NF}' file.txt

NF is an awk variable, expands to the number of fields in a record (line), hence $NF (note the $ in front) contains the value of the last field.

Example:

% cat temp.txt
Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341
% awk '{print $NF}' temp.txt
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

I decided to compare the different solutions, listed here. For this purpose I've created a large file, based on the content provided by the OP:

I created a simple file, named input.file:

$ cat input.file
Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341

Then I executed this loop:

for i in {1..100}; do cat input.file | tee -a input.file; done

Terminal window was blocked. I executed killall tee from another terminal. Then I examined the content of the file by the commands: less input.file and cat input.file. It looked good, except the last line. So I removed the last line and created a backup copy: cp input.file{,.copy} (because of the commands that use inplace option).
The final count of the lines into the file input.file is 2 192 473. I got that number by the command wc:
```
$ cat input.file | wc -l
2192473
```

Here is the result of the comparison:

grep -o '[^[:space:]]\+$'

$ time grep -o '[^[:space:]]\+$' input.file > output.file
real 0m58.539s
user 0m58.416s
sys 0m0.108s

sed -ri 's/.* = (.*)/\1/'

$ time sed -ri 's/.* = (.*)/\1/' input.file
real 0m26.936s
user 0m22.836s
sys 0m4.092s

Alternatively if we redirect the output to a new file the command is more faster:

$ time sed -r 's/.* = (.*)/\1/' input.file > output.file
real 0m19.734s
user 0m19.672s
sys 0m0.056s

gawk '{gsub(".*= ", "");print}'

$ time gawk '{gsub(".*= ", "");print}' input.file > output.file
real 0m5.644s
user 0m5.568s
sys 0m0.072s

rev | cut -d' ' -f1 | rev

$ time rev input.file | cut -d' ' -f1 | rev > output.file
real 0m3.703s
user 0m2.108s
sys 0m4.916s

grep -oP '.*= \K.*'

$ time grep -oP '.*= \K.*' input.file > output.file
real 0m3.328s
user 0m3.252s
sys 0m0.072s

sed 's/.*= //' (respectively the -i option makes the command few times slower)

$ time sed 's/.*= //' input.file > output.file
real 0m3.310s
user 0m3.212s
sys 0m0.092s

perl -pe 's/.*= //' (the -i option doesn't produce big difference in the productivity here)

$ time perl -i.bak -pe 's/.*= //' input.file
real 0m3.187s
user 0m3.128s
sys 0m0.056s

$ time perl -pe 's/.*= //' input.file > output.file
real 0m3.138s
user 0m3.036s
sys 0m0.100s

awk '{print $NF}'

$ time awk '{print $NF}' input.file > output.file
real 0m1.251s
user 0m1.164s
sys 0m0.084s

cut -c 35-

$ time cut -c 35- input.file > output.file
real 0m0.352s
user 0m0.284s
sys 0m0.064s

cut -d= -f2

$ time cut -d= -f2 input.file > output.filereal 0m0.328s
user 0m0.260s
sys 0m0.064s

The source of the idea.

With grep and the -P for having PCRE (Interpret the pattern as a Perl-Compatible Regular Expression) and the -o to print matched pattern alone. The \K notify will ignore the matched part come before itself.

$ grep -oP '.*= \K.*' infile
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

Or you could use cut command instead.

cut -d= -f2 infile

Since the line prefix always has the same length (34 characters) you can use cut:

cut -c 35- < input.txt > output.txt

Reverse the content of the file with rev, pipe the output into cut with space as delimiter and 1 as the target field, then reverse it again to get the original number:

$ rev your_file | cut -d' ' -f1 | rev
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

This is simple, short, and easy to write, understand, and check, and I personally like it:

grep -oE '\S+$' file

grep in Ubuntu, when invoked with -E or -P, takes the shorthand \s to mean a whitespace character (in practice usually a space or tab) and \S to mean anything that isn't one. Using the quantifier + and the end-of-line anchor $, the pattern \S+$ matches one or more non-blanks at the end of a line. You can use -P instead of -E; the meaning in this case is the same but a different regular expressions engine is used, so they may have different performance characteristics.

This is equivalent to Avinash Raj's commented solution (just with an easier, more compact syntax):

grep -o '[^[:space:]]\+$' file

These approaches won't work if there could be trailing whitespace after the number. They can be modified so they do, but I see no point in going into that here. Although it's sometimes instructive to generalize a solution to work under more cases, it's not practical to do so nearly as often as people tend to assume, because one usually has no way to know in which of many different incompatible ways the problem might ultimately need to be generalized.

Performance is sometimes an important consideration. This question doesn't stipulate that the input is very large, and it's likely that every method that has been posted here is fast enough. However, in case speed is desired, here's a small benchmark on a ten million line input file:

$ perl -e 'print((<>) x 2000000)' file > bigfile
$ du -sh bigfile
439M bigfile
$ wc -l bigfile
10000000 bigfile
$ TIMEFORMAT=%R
$ time grep -o '[^[:space:]]\+$' bigfile > bigfile.out
819.565
$ time grep -oE '\S+$' bigfile > bigfile.out
816.910
$ time grep -oP '\S+$' bigfile > bigfile.out
67.465
$ time cut -d= -f2 bigfile > bigfile.out
3.902
$ time grep -o '[^[:space:]]\+$' bigfile > bigfile.out
815.183
$ time grep -oE '\S+$' bigfile > bigfile.out
824.546
$ time grep -oP '\S+$' bigfile > bigfile.out
68.692
$ time cut -d= -f2 bigfile > bigfile.out
4.135

I ran it twice in case the order mattered (as it sometimes does for I/O-heavy tasks) and because I didn't have a machine available that wasn't doing other stuff in the background that could skew the results. From those results I conclude the following, at least provisionally and for input files of the size I used:

Wow! Passing -P (to use PCRE) rather than -G (the default when no dialect is specified) or -E made grep faster by over an order of magnitude. So for large files, it may be better to use this command than the one shown above:
```
grep -oP '\S+$' file
```
WOW!! The cut method in αғsнιη's answer, cut -d= -f2 file, is over an order of magnitude quicker than even the faster version of my way! It was the winner in pa4080's benchmark as well, which covered more methods than this but with smaller input--and which is why I chose it, of all the other methods, to include in my test. If performance is important or files are huge, I think αғsнιη's cut method should be used.
This also serves as a reminder that the simple cut and paste utilities shouldn't be forgotten, and should perhaps be preferred when applicable, even though there are more sophisticated tools like grep that are often offered as first-line solutions (and that I am personally more accustomed to using).

perl - substitute the pattern /.*= / with empty string //:

perl -pe 's/.*= //' input.file > output.file

perl -i.bak -pe 's/.*= //' input.file

From perl --help:

-e program one line of program (several -e's allowed, omit programfile)
-p assume loop like -n but print line also, like sed
-i[extension] edit <> files in place (makes backup if extension supplied)

sed - substitute the pattern with empty string:

sed 's/.*= //' input.file > output.file

or (but slower than the above):

sed -i.bak 's/.*= //' input.file

I mention this approach, because it is few times faster than those in Zanna's answer.

gawk - substitute the pattern ".*= " with empty string "":

gawk '{gsub(".*= ", "");print}' input.file > output.file

From man gawk:

gsub(r, s [, t]) For each substring matching the regular expression r in the string t, substitute the string s, and return the number of substitutions. If t is not supplied, use $0...

Velvet Star Monitor

How to remove particular words from lines of a text file?

8 Answers

Notes

Your Answer

Sign up or log in

Post as a guest

Similar Journal

Persona 3 Portable - 10/21 atm, reached tartarus – What do I do?

Ability timers increasing when overused

How do I complete the "Everyone's A Critic" mission?

Which versions of Final Fantasy VI include multiplayer battle support?