25 October 2015

Awk

I'll create a delimited file from an agriculture sector* document as a sample for this entry.

This entry will look at awk, a pattern-directed scanning and processing language.

The AWK Programming Language
B.W. Kernighan, P.J. Weinberger,
A.V. Aho, © 1988 Bell Telephone
Laboratories, Incorporated

UNIX Awk: http://www.cs.princeton.edu/~bwk/btl.mirror/index.html
GNU Awk: https://www.gnu.org/software/gawk/manual/
Plan 9: http://man.cat-v.org/plan_9/1/awk

Navn: Institut Français des Productions Cidricoles
Adresse: La Rangée Chesnel, 61500 Sées, Frankrig
Telefon: +33 2 33 27 56 70
Web adresse: http://www.ifpc.eu/

* Varietes_cidricoles.pdf (POMME A CIDRE)

pdftotext version 0.20.2
GNU sed version 4.2.1 (has the -i argument)

sh-4.2$ pdftotext -f 6 -l 6 -layout Varietes_cidricoles.pdf
sh-4.2$ sed -i '12, 16d; s/3 DOUCE/ DOUCE/g; s/2 DOUCE/ DOUCE /g' Varietes_cidricoles.txt
sh-4.2$ head -n 67 Varietes_cidricoles.txt | tail -n +11 \
> | cut -f 41- -d " " > output
sh-4.2$

The resulting file named 'output' is a three column width file, modify the file so
it is single column as appears at the base of this Internet log (blog) post.

Add delimiters to the records excepting the uppercase and blank lines.

sh-4.2$ sed -i 's/ (/:(/g; s/) /):/g; s/[0-9]$/&::/g; s/[a-z]$/&::::/g' output
sh-4.2$

Add data to the fourth fields excepting the uppercase and blank lines.

sh-4.2$ sed -i '2, 15s/::$/:amere:/; 18, 38s/::$/:douce amere:/; 41, 50s/::$/:douce:/' output
sh-4.2$ sed -i '53, 60s/::$/:acidulee:/; 63s/::$/:aigre amere:/; 66, 82s/::$/:aigre:/' output
sh-4.2$

Obtain the names of the registered cultivars.

sh-4.2$ pdftotext Varietes_cidricoles.pdf - | grep ® | sort -u
Chanteline®
Chanteline® 1
Cidor®
Judaine®
Judaine® 1
Judeline®
Judeline® 1
Juliana®
sh-4.2$

Determine if these are also present in our text file.

sh-4.2$ for cultivar in Chanteline Cidor Judaine Judeline Juliana
> do
> grep $cultivar output
> done
Cidor:(4.02 / 25.9):1055:amere:
Judaine:(0.75 / 100.4):1053:aigre:
Judeline:(0.67 / 75.42):1050:acidulee:
Juliana:(1.78 / 154):1061:aigre:
sh-4.2$

Then enter the registered mark as unicode chars in the appropriate records.

sh-4.2$ cat script
4s/$/&®/
55s/$/&®/
72s/$/&®/
75s/$/&®/
sh-4.2$ file script
script: UTF-8 Unicode text
sh-4.2$ sed -i -f script output
sh-4.2$

Delete the uppercase and blank lines.

sh-4.2$ sed -i '1d; 16, 17d; 39, 40d; 51, 52d; 61, 62d; 64, 65d' output
sh-4.2$

Sort the file alphabetically and redirect stdout to a new file.

sh-4.2$ export LC_ALL=fr_FR.UTF-8
sh-4.2$ sort output > varietes_cidricoles
sh-4.2$ exit

The resulting file which will be utilised for exercises with awk.

sh-4.2$ awk -F: '/aigre amere/' varietes_cidricoles
Cazo Jaune:(3.28 / 196.2):1054:aigre amere:
sh-4.2$

sh-4.2$ awk -F: 'NR == 18 { print $1$5 " is a bitter apple." }' varietes_cidricoles
Cidor® is a bitter apple.
sh-4.2$

sh-4.2$ awk -F: 'NR == 23 { gsub("oe", "oë", $1); print }' varietes_cidricoles
Douce Coëtligné (1.83 / 29.09) 1051 douce
sh-4.2$

sh-4.2$ awk -F: 'NR==39 || NR==58 { gsub(/^\(/, "", $2); gsub(/\)$/, "", $2); split($2,a," / "); \
> print $1$5 " has " a[1] " g/l polyphenols and " a[2] " meq/l acidity." }' varietes_cidricoles
Judeline® has 0.67 g/l polyphenols and 75.42 meq/l acidity.
Petit Jaune has 1.2 g/l polyphenols and 110.59 meq/l acidity.
sh-4.2$

Create a CRLF .csv file for use in an MS® spreadsheet.

sh-4.2$ tr ':' ',' < varietes_cidricoles | awk '{ printf "%s\r\n",  $0 }' > varietes_cidricoles.csv
sh-4.2$

One can manually delete in notepad.exe the empty line at the end of the .csv file.

sh-4.2$ cat output
AMERE
C'Huero Briz (4.67 / 30.65) 1056
Chevalier Jaune (3.79 / 33.09) 1053
Cidor (4.02 / 25.9) 1055
Domaines (3.6 / 31.81) 1067
Doux Joseph (3.61 / 31.04) 1058
Fréquin Rouge Petit (5.06 / 36.35) 1065
Jeanne Renard (4.23 / 25.35) 1065
Kermerrien (4.32 / 22.17) 1062
Marie Ménard (4.82 / 32.78) 1061
Marin Onfroy (3.25 / 25.5) 1059
Meriennet
Mettais (3.78 / 25.61) 1063
Petit Amer (4.43 / 29.65) 1055
Tardive de la Sarthe (3.68 / 29.51) 1059

DOUCE AMERE
Antoinette (2.73 / 36.79) 1050
Argile Rouge Bruyère
Bedan (2.27 / 22.04) 1056
Binet Blanc (2.52 / 26.5) 1060
Binet Rouge (2.41 / 26.54) 1063
Binet Violet (2.62 / 24.93) 1050
Bisquet (2.12 / 31.31) 1045
C'Huero Ru
Cartigny (2.21 / 31.26) 1051
Clos Renaux (2.2 / 36.91) 1052
Clozette Douce (2.32 / 33.25) 1054
Douce Moen (2.43 / 31.85) 1061
Doux Lozon (2.13 / 21.21) 1053
Gros Bois (2.86 / 42.15) 1050
Gros Oeillet
Moulin à Vent (2.7 / 38.96) 1061
Muscadet de Dieppe (2.54 / 30.95) 1055
Noël des Champs (2.33 / 22.6) 1050
Omont (2.2 / 29) 1063
Peau de Chien (2.95 / 32.25) 1065
Saint Martin (2.27 / 30.09) 1055

DOUCE
Belle Fille de la Manche
Douce Coetligné (1.83 / 29.09) 1051
Doux au Gober
Doux Evêque Jaune (1.96 / 24.25) 1052
Doux Normandie (1.42 / 24.56) 1065
Doux Veret de Carrouges (1.77 / 24.62) 1055
Muscadet Petit de l'Orne (1.9 / 24.23) 1061
Queue Torte
Rouge Duret (1.65 / 25.3) 1049
Rousse de la Sarthe (1.64 / 25.22) 1056

ACIDULEE
Blanchet (1.28 / 71.4) 1050
Guillevic (1.35 / 86.29) 1059
Judeline (0.67 / 75.42) 1050
Locard Blanc
Pomme de Moi
Rouget de Dol Gros
Sebin Blanc (1.44 / 81.2) 1051
Tesnière (1.56 / 83.97) 1055

AIGRE AMERE
Cazo Jaune (3.28 / 196.2) 1054

AIGRE
Armagnac (1.7 / 121.21) 1065
Avrolles (0.87 / 178.99) 1055
Blanc Sur (1.23 / 99.25) 1055
Diot Roux (1.85 / 146.48) 1052
Gesnot (1.11 / 97) 1049
Jaune de Vitré (1.61 / 132.44) 1060
Judaine (0.75 / 100.4) 1053
Judin (0.89 / 98.24) 1060
Judor (0.73 / 100.63) 1052
Juliana (1.78 / 154) 1061
Jurella (0.61 / 126.7) 1053
Locard Vert
Petit Jaune (1.2 / 110.59) 1055
Pomme de Bouet
Rambault (1.4 / 95.76) 1061
Rénao
René Martin (1.38 / 113.31) 1053
sh-4.2$