Socialify

Folder ..

Viewing 103_tokenization.zig
150 lines (145 loc) • 5.8 KB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
//
// The functionality of the standard library is becoming increasingly
// important in Zig. First of all, it is helpful to take a look at how
// the individual functions are implemented. Because this is wonderfully
// suitable as a template for your own functions. In addition these
// standard functions are part of the basic configuration of Zig.
//
// This means that they are always available on every system.
// Therefore it is worthwhile to deal with them also in Ziglings.
// It's a great way to learn important skills. For example, it is
// often necessary to process large amounts of data from files.
// And for this sequential reading and processing, Zig provides some
// useful functions, which we will take a closer look at in the coming
// exercises.
//
// A nice example of this has been published on the Zig homepage,
// replacing the somewhat dusty 'Hello world!
//
// Nothing against 'Hello world!', but it just doesn't do justice
// to the elegance of Zig and that's a pity, if someone takes a short,
// first look at the homepage and doesn't get 'enchanted'. And for that
// the present example is simply better suited and we will therefore
// use it as an introduction to tokenizing, because it is wonderfully
// suited to understand the basic principles.
//
// In the following exercises we will also read and process data from
// large files and at the latest then it will be clear to everyone how
// useful all this is.
//
// Let's start with the analysis of the example from the Zig homepage
// and explain the most important things.
//
//    const std = @import("std");
//
//    // Here a function from the Standard library is defined,
//    // which transfers numbers from a string into the respective
//    // integer values.
//    const parseInt = std.fmt.parseInt;
//
//    // Defining a test case
//    test "parse integers" {
//
//        // Four numbers are passed in a string.
//        // Please note that the individual values are separated
//        // either by a space or a comma.
//        const input = "123 67 89,99";
//
//        // In order to be able to process the input values,
//        // memory is required. An allocator is defined here for
//        // this purpose.
//        const ally = std.testing.allocator;
//
//        // The allocator is used to initialize an array into which
//        // the numbers are stored.
//        var list = std.ArrayList(u32).init(ally);
//
//        // This way you can never forget what is urgently needed
//        // and the compiler doesn't grumble either.
//        defer list.deinit();
//
//        // Now it gets exciting:
//        // A standard tokenizer is called (Zig has several) and
//        // used to locate the positions of the respective separators
//        // (we remember, space and comma) and pass them to an iterator.
//        var it = std.mem.tokenizeAny(u8, input, " ,");
//
//        // The iterator can now be processed in a loop and the
//        // individual numbers can be transferred.
//        while (it.next()) |num| {
//            // But be careful: The numbers are still only available
//            // as strings. This is where the integer parser comes
//            // into play, converting them into real integer values.
//            const n = try parseInt(u32, num, 10);
//
//            // Finally the individual values are stored in the array.
//            try list.append(n);
//        }
//
//        // For the subsequent test, a second static array is created,
//        // which is directly filled with the expected values.
//        const expected = [_]u32{ 123, 67, 89, 99 };
//
//        // Now the numbers converted from the string can be compared
//        // with the expected ones, so that the test is completed
//        // successfully.
//        for (expected, list.items) |exp, actual| {
//            try std.testing.expectEqual(exp, actual);
//        }
//    }
//
// So much for the example from the homepage.
// Let's summarize the basic steps again:
//
// - We have a set of data in sequential order, separated from each other
//   by means of various characters.
//
// - For further processing, for example in an array, this data must be
//   read in, separated and, if necessary, converted into the target format.
//
// - We need a buffer that is large enough to hold the data.
//
// - This buffer can be created either statically at compile time, if the
//   amount of data is already known, or dynamically at runtime by using
//   a memory allocator.
//
// - The data are divided by means of Tokenizer at the respective
//   separators and stored in the reserved memory. This usually also
//   includes conversion to the target format.
//
// - Now the data can be conveniently processed further in the correct format.
//
// These steps are basically always the same.
// Whether the data is read from a file or entered by the user via the
// keyboard, for example, is irrelevant. Only subtleties are distinguished
// and that's why Zig has different tokenizers. But more about this in
// later exercises.
//
// Now we also want to write a small program to tokenize some data,
// after all we need some practice. Suppose we want to count the words
// of this little poem:
//
// 	My name is Ozymandias, King of Kings;
// 	Look on my Works, ye Mighty, and despair!
// 	 by Percy Bysshe Shelley
//
//
const std = @import("std");
const print = std.debug.print;

pub fn main() !void {

    // our input
    const poem =
        \\My name is Ozymandias, King of Kings;
        \\Look on my Works, ye Mighty, and despair!
    ;

    // now the tokenizer, but what do we need here?
    var it = std.mem.tokenizeAny(u8, poem, ???);

    // print all words and count them
    var cnt: usize = 0;
    while (it.next()) |word| {
        cnt += 1;
        print("{s}\n", .{word});
    }

    // print the result
    print("This little poem has {d} words!\n", .{cnt});
}