How to remove particular words from lines of a text file?
Matthew Barrera
my text file looks like this:
Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341now I want to delete Liquid penetration 95% mass (m) from my lines to obtain the values only. How should I do it?
8 Answers
If there's only one = sign, you could delete everything before and including = like this:
$ sed -r 's/.* = (.*)/\1/' file
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341If you want to change the original file, use the -i option after testing:
sed -ri 's/.* = (.*)/\1/' fileNotes
-ruse ERE so we don't have to escape(and)s/old/newreplaceoldwithnew.*any number of any characters(things)savethingsto backreference later with\1,\2, etc.
This is a job for awk; assuming the values occur in last field only (as per your example):
awk '{print $NF}' file.txtNFis anawkvariable, expands to the number of fields in a record (line), hence$NF(note the$in front) contains the value of the last field.
Example:
% cat temp.txt
Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341
% awk '{print $NF}' temp.txt
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341 I decided to compare the different solutions, listed here. For this purpose I've created a large file, based on the content provided by the OP:
I created a simple file, named
input.file:$ cat input.file Liquid penetration 95% mass (m) = 0.000205348 Liquid penetration 95% mass (m) = 0.000265725 Liquid penetration 95% mass (m) = 0.000322823 Liquid penetration 95% mass (m) = 0.000376445 Liquid penetration 95% mass (m) = 0.000425341Then I executed this loop:
for i in {1..100}; do cat input.file | tee -a input.file; doneTerminal window was blocked. I executed
killall teefrom another terminal. Then I examined the content of the file by the commands:less input.fileandcat input.file. It looked good, except the last line. So I removed the last line and created a backup copy:cp input.file{,.copy}(because of the commands that use inplace option).The final count of the lines into the file
input.fileis 2 192 473. I got that number by the commandwc:$ cat input.file | wc -l 2192473
Here is the result of the comparison:
$ time grep -o '[^[:space:]]\+$' input.file > output.file real 0m58.539s user 0m58.416s sys 0m0.108s
$ time sed -ri 's/.* = (.*)/\1/' input.file real 0m26.936s user 0m22.836s sys 0m4.092s
Alternatively if we redirect the output to a new file the command is more faster:
$ time sed -r 's/.* = (.*)/\1/' input.file > output.file real 0m19.734s user 0m19.672s sys 0m0.056s
gawk '{gsub(".*= ", "");print}'$ time gawk '{gsub(".*= ", "");print}' input.file > output.file real 0m5.644s user 0m5.568s sys 0m0.072s$ time rev input.file | cut -d' ' -f1 | rev > output.file real 0m3.703s user 0m2.108s sys 0m4.916s
$ time grep -oP '.*= \K.*' input.file > output.file real 0m3.328s user 0m3.252s sys 0m0.072s
sed 's/.*= //'(respectively the-ioption makes the command few times slower)$ time sed 's/.*= //' input.file > output.file real 0m3.310s user 0m3.212s sys 0m0.092s
perl -pe 's/.*= //'(the-ioption doesn't produce big difference in the productivity here)$ time perl -i.bak -pe 's/.*= //' input.file real 0m3.187s user 0m3.128s sys 0m0.056s
$ time perl -pe 's/.*= //' input.file > output.file real 0m3.138s user 0m3.036s sys 0m0.100s
$ time awk '{print $NF}' input.file > output.file real 0m1.251s user 0m1.164s sys 0m0.084s$ time cut -c 35- input.file > output.file real 0m0.352s user 0m0.284s sys 0m0.064s
$ time cut -d= -f2 input.file > output.filereal 0m0.328s user 0m0.260s sys 0m0.064s
With grep and the -P for having PCRE (Interpret the pattern as a Perl-Compatible Regular Expression) and the -o to print matched pattern alone. The \K notify will ignore the matched part come before itself.
$ grep -oP '.*= \K.*' infile
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341Or you could use cut command instead.
cut -d= -f2 infile 1 Since the line prefix always has the same length (34 characters) you can use cut:
cut -c 35- < input.txt > output.txt Reverse the content of the file with rev, pipe the output into cut with space as delimiter and 1 as the target field, then reverse it again to get the original number:
$ rev your_file | cut -d' ' -f1 | rev
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341 This is simple, short, and easy to write, understand, and check, and I personally like it:
grep -oE '\S+$' filegrep in Ubuntu, when invoked with -E or -P, takes the shorthand \s to mean a whitespace character (in practice usually a space or tab) and \S to mean anything that isn't one. Using the quantifier + and the end-of-line anchor $, the pattern \S+$ matches one or more non-blanks at the end of a line. You can use -P instead of -E; the meaning in this case is the same but a different regular expressions engine is used, so they may have different performance characteristics.
This is equivalent to Avinash Raj's commented solution (just with an easier, more compact syntax):
grep -o '[^[:space:]]\+$' fileThese approaches won't work if there could be trailing whitespace after the number. They can be modified so they do, but I see no point in going into that here. Although it's sometimes instructive to generalize a solution to work under more cases, it's not practical to do so nearly as often as people tend to assume, because one usually has no way to know in which of many different incompatible ways the problem might ultimately need to be generalized.
Performance is sometimes an important consideration. This question doesn't stipulate that the input is very large, and it's likely that every method that has been posted here is fast enough. However, in case speed is desired, here's a small benchmark on a ten million line input file:
$ perl -e 'print((<>) x 2000000)' file > bigfile
$ du -sh bigfile
439M bigfile
$ wc -l bigfile
10000000 bigfile
$ TIMEFORMAT=%R
$ time grep -o '[^[:space:]]\+$' bigfile > bigfile.out
819.565
$ time grep -oE '\S+$' bigfile > bigfile.out
816.910
$ time grep -oP '\S+$' bigfile > bigfile.out
67.465
$ time cut -d= -f2 bigfile > bigfile.out
3.902
$ time grep -o '[^[:space:]]\+$' bigfile > bigfile.out
815.183
$ time grep -oE '\S+$' bigfile > bigfile.out
824.546
$ time grep -oP '\S+$' bigfile > bigfile.out
68.692
$ time cut -d= -f2 bigfile > bigfile.out
4.135I ran it twice in case the order mattered (as it sometimes does for I/O-heavy tasks) and because I didn't have a machine available that wasn't doing other stuff in the background that could skew the results. From those results I conclude the following, at least provisionally and for input files of the size I used:
Wow! Passing
-P(to use PCRE) rather than-G(the default when no dialect is specified) or-Emadegrepfaster by over an order of magnitude. So for large files, it may be better to use this command than the one shown above:grep -oP '\S+$' fileWOW!! The
cutmethod in αғsнιη's answer,cut -d= -f2 file, is over an order of magnitude quicker than even the faster version of my way! It was the winner in pa4080's benchmark as well, which covered more methods than this but with smaller input--and which is why I chose it, of all the other methods, to include in my test. If performance is important or files are huge, I think αғsнιη'scutmethod should be used.This also serves as a reminder that the simple
cutandpasteutilities shouldn't be forgotten, and should perhaps be preferred when applicable, even though there are more sophisticated tools likegrepthat are often offered as first-line solutions (and that I am personally more accustomed to using).
perl - substitute the pattern /.*= / with empty string //:
perl -pe 's/.*= //' input.file > output.fileperl -i.bak -pe 's/.*= //' input.fileFrom
perl --help:-e program one line of program (several -e's allowed, omit programfile) -p assume loop like -n but print line also, like sed -i[extension] edit <> files in place (makes backup if extension supplied)
sed - substitute the pattern with empty string:
sed 's/.*= //' input.file > output.fileor (but slower than the above):
sed -i.bak 's/.*= //' input.file- I mention this approach, because it is few times faster than those in Zanna's answer.
gawk - substitute the pattern ".*= " with empty string "":
gawk '{gsub(".*= ", "");print}' input.file > output.fileFrom
man gawk:gsub(r, s [, t]) For each substring matching the regular expression r in the string t, substitute the string s, and return the number of substitutions. If t is not supplied, use $0...