Parsing Numeric Constants

Warning! Some information on this page is older than 6 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Fri
16
Dec 2011

As a personal project I started coding a scripting language. First thing I want to do is parsing of integer and floating point numeric constants. My decision about what syntax to support is based on C++ language, but with some modifications.

Integer constant in C++ can be written as:

123    Decimal       Starting with non-zero digit
0x7B   Hexadecimal   Starting with "0x"
0173   Octal         Starting with "0"

It can also be suffixed with "u" for unsigned type and "l" for long or "ll" for "long long".

"l" makes no sense in Visual C++ because "long" type is equal to normal "int" - it has 32 bits, even in 64-bit code. So I'd prefer to use "long" as type and "ll" as suffix for 64-bit numbers.

I also don't like the octal form. First, I can't see any use of it. In the whole computer science I've seen only one situation where octal system is used: file permissions in Unix. I didn't see any single use of octal form in C++ code. On the other hand, I think preceding number with zeros shouldn't change its meaning, so the choice of "0" as prefix for octal system (instead of, for example, "0o") is very unfortunate in my opinion.

It would be much more useful if we could place binary numbers in code. Java 7 introduces such syntax with "0b" prefix. It has also another interesting feature I like - it allows underscores in numeric literals so you can make long constants more readable by grouping digits, like "0b0011_1010".

I'd like to support decimal, hexadecimal and binary numbers in my language. Regular expressions that match these are:

[0-9][0-9_]*[Uu]?[Ll]?
0[Xx][0-9A-Fa-f_]+[Uu]?[Ll]?
0[Bb][01_]+[Uu]?[Ll]?

Floating-point numbers are more sophisticated. A constant that uses all possible features might look like this:

111.222e-3f

Question is which parts are required and which are optional? It may seem that floating-point numbers and their representation in code is something obvious, but there actually are subtle differences between programming languages. "111" is obviously an integer constant, but is the presence of a dot with no digits on the left, no digits on the right, an exponent part or "f" suffix enough to for a proper floating-point constant?

111.222   C++: OK      HLSL: OK      C#: OK
111.      C++: OK      HLSL: OK      C#: Error
.222      C++: OK      HLSL: OK      C#: OK
111e3     C++: OK      HLSL: OK      C#: OK
111f      C++: Error   HLSL: Error   C#: OK

I want to support all these options, so regular expressions that match floating-point constants in my language are:

[0-9]+[Ff]
[0-9]+([eE][+-]?[0-9]+)[Ff]?
[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?[Ff]?
\.[0-9]+([eE][+-]?[0-9]+)?[Ff]?

Comments | #languages #compilers Share