What negative consequences can arise from this language design rule?

Clarification: the rule is meant to prevent accessing variables that are not declared yet.

Clarification 2: the rule mandates that the compiler follows calls to functions which are defined in the same scope as the call and happen before the actual definition (forward referencing) – making the compiler much alike an interpreter.

I’m thinking about a fictional programming language that behaves similar to JavaScript when declaring functions: that is, when using the function keyword to create named functions, they get “hoisted” to the top of the scope and can be used anywhere where their name is visible.

For example, the following is legal:

<code>f();

function f() { console.log("f called") };

</code>

<code>f(); function f() { console.log("f called") }; </code>

f();
function f() { console.log("f called") };

However, there is a potential problem: what if the function f uses a variable x, that is defined after the call, like here:

<code>f();

var x = 1;

function f() { console.log(x); }

</code>

<code>f(); var x = 1; function f() { console.log(x); } </code>

f();
var x = 1;
function f() { console.log(x); }

In case of JavaScript, this is also legal, because variables are hoisted as well – the program gets essentially transformed into something like this:

<code>var x;

function f() { console.log(x); }

f();

x = 1;

</code>

<code>var x; function f() { console.log(x); } f(); x = 1; </code>

var x;
function f() { console.log(x); }

f();
x = 1;

And therefore just prints undefined.

In my fictional language, variables wouldn’t get hoisted, which means that the example before would produce an error. The rule concerns how this error is generated:

Functions are visible for the entire block in which they are defined,
but it is illegal to call a function that accesses a variable declared
after the call.

This means that each time the compiler for this language would encounter a function call, it would have to check if the function is defined in the current scope, and if it is, check if it accesses variables in the current scope that are declared after the call.

The reason for this rule is that it makes no sense to access variables before they are declared. Hoisting the variables to the top of their scopes like JavaScript does is a hack that only leaves room for error! This is why I believe it must not be done, but such situations should be detected and treated as errors (because they are, you’re referencing something that doesn’t chronologically exist yet).

I’m looking for reasons why this is a bad rule, i.e. results in complications or inconsistencies.

The reason I suspect that the rule is a bad one is that it does not exist in the Scala language. Scala solved this problem with a much more general rule:

The scope of a name introduced by a declaration or definition is the
whole statement sequence containing the binding. However, there is a
restriction on forward references in blocks: In a statement sequence
s[1]...s[n] making up a block, if a simple name in s[i] refers to
an entity defined by s[j] where j >= i, then for all s[k]
between and including s[i] and s[j],

s[k] cannot be a variable definition.

If s[k] is a value definition, it must be lazy.

UPDATE:

For the time being I have adopted Scala’s approach.

This approach is a more general version of my rule: it will catch all illegal accesses, but also some that are not – this is a tradeoff that I’m willing to make since I think that my rule may be too complicated or even impossible to implement.

Here’s the rephrased/explained rule from Scala:

Variables cannot be forward-referenced, but lazy values, functions defined with def (which are just syntax sugar for lambdas assigned to (lazy) values, might even remove them as Doval suggested), classes, and modules can.

The restriction is that between the reference and the declaration (inclusive), there must be no variable definitions, because they might be altered by the entity being referenced before they are introduced to the program.

You’ve already created at least one inconsistency, and a particularly damning one at that – treating functions differently from other values. A function declaration is conceptually the same as binding a variable to an anonymous function, but you’re asserting that “regular” variables don’t get hoisted while function declarations do. So…

<code>f()

function f() { /* ... */ }

</code>

<code>f() function f() { /* ... */ } </code>

f()
function f() { /* ... */ }

Works, but…

<code>f()

var f = function() { /* ... */ } // Pretend this is anonymous function syntax

</code>

<code>f() var f = function() { /* ... */ } // Pretend this is anonymous function syntax </code>

f()
var f = function() { /* ... */ } // Pretend this is anonymous function syntax

Doesn’t. (And if you don’t allow passing functions around like any other value, your language is horribly crippled.) Next, what happens if a function is redefined?

<code>f()

function f() { /* definition 1 */ }

...

function f() { /* definition 2 */ }

</code>

<code>f() function f() { /* definition 1 */ } ... function f() { /* definition 2 */ } </code>

f()
function f() { /* definition 1 */ }
...
function f() { /* definition 2 */ }

If you treat this as an error, you’re once again rejecting functions as values. There’s no reason you shouldn’t be able to do:

<code>var f = function() { /* def 1 */ }

f = function() { /* def 2 */ }

</code>

<code>var f = function() { /* def 1 */ } f = function() { /* def 2 */ } </code>

var f = function() { /* def 1 */ }
f = function() { /* def 2 */ }

So let’s assume you’ll allow the function to be redefined. How do you deal with lexical scoping?

<code>function f() { /* definition 1 */ }

function bar() {

f()

function f() { /* definition 2 */ }

</code>

<code>function f() { /* definition 1 */ } function bar() { f() function f() { /* definition 2 */ } </code>

function f() { /* definition 1 */ }
function bar() {
    f()
    function f() { /* definition 2 */ }

If you hoist definition 2 to the top of bar‘s scope, you end up with this:

<code>function f() { /* definition 1 */ }

function bar() {

function f() { /* definition 2 */ }

f()

</code>

<code>function f() { /* definition 1 */ } function bar() { function f() { /* definition 2 */ } f() </code>

function f() { /* definition 1 */ }
function bar() {
    function f() { /* definition 2 */ }
    f()

Which means you can never call the outer definition from a nested scope unless you give the inner function a different name. Moreover, in any other language, the meaning of the above program is independent of the choice of name for definition 2. The hoisting rule changes the meaning depending on whether there’s a name collision or not. On the other hand if you don’t hoist definition 2 to the top of bar, you’ve broken your own rule.

The ‘fictional consequences’ vs. the ‘fictional benefits’ means you are looking for the downsides of such a rule if you were designing a language.

The biggest downside i see is that anytime you don’t make your language do something “useful” (generating an error falls under this class), you weaken the power of the language. You are missing an opportunity to do anything useful with it.

You could make both of the defined in the same scope as you say javascript does: the benefit being that something has to come first — the vars, or the functions — if your want your functions to be able
to make use of the variables in the lexical scope of its definition, then it makes perfect sense to treat those variables as though they’d been declared before “f”‘s first call.

That’s useful for having local functions

<code>func bigfunc() {...

prologue... setup .. parameter checking...

</code>

<code>func bigfunc() {... prologue... setup .. parameter checking... </code>

func bigfunc() {...
  prologue... setup .. parameter checking...

Now:

<code> for loop over the sanitized params (p1, p2, p3...) {

func a_case () {...will need access to prologue and setup vars}

func b_case () {... ditto....}

....

func z_case() {}

</code>

<code> for loop over the sanitized params (p1, p2, p3...) { func a_case () {...will need access to prologue and setup vars} func b_case () {... ditto....} .... func z_case() {} </code>

  for loop over the sanitized params (p1, p2, p3...) {

    func a_case () {...will need access to prologue and setup vars}

    func b_case () {... ditto....}

    ....
    func z_case() {}

Look for input & lowercase it… then take first letter
and prepend it to ‘_case’, and call that function for further processing.

<code> endloop

} #end func

</code>

<code> endloop } #end func </code>

  endloop
} #end func

So the functions in this case are helper functions that “want” or “need” access to the setup information created by bigfunc, but have no reason to be seen outside of bigfunc’s scope.

So that’s one possible feature you miss out on — the ability to have nested functions where the inner ones are only callable by code in ‘bigfunc’).

Another option would have each invocation of bigfunc gets it’s own
private copy of the value of the var when it is defined.

I.e. suppose bigfunc is a “function factory”, that returns anonymous
bindings to that use a “specific” value of the var ‘x’, when the function is invoked.

So we have…

<code>func bigproducerfunc(parameters: object_instance) {

var object_instance; #maybe a filename?

// nested functions....

function size(...return size of object instance...)

function last_mod{,owner,destroy...} {

// do operation on value of 'x' that was passed in

// 'object_instance' when the function was initially called.

}

function check_mod_time {... checks last modification time

on the var it was called with)

}

var myhash funcs_by_name = (add(), size, last_mod()...);

return refto(myhash).

}

</code>

<code>func bigproducerfunc(parameters: object_instance) { var object_instance; #maybe a filename? // nested functions.... function size(...return size of object instance...) function last_mod{,owner,destroy...} { // do operation on value of 'x' that was passed in // 'object_instance' when the function was initially called. } function check_mod_time {... checks last modification time on the var it was called with) } var myhash funcs_by_name = (add(), size, last_mod()...); return refto(myhash). } </code>

func bigproducerfunc(parameters: object_instance) {

  var object_instance;  #maybe a filename?

  // nested functions....

  function size(...return size of object instance...)

  function last_mod{,owner,destroy...} {
    // do operation on value of 'x' that was passed in 
    // 'object_instance' when the function was initially called.
  }

   function check_mod_time {... checks last modification time 
       on the var it was called with)
   }

    var myhash funcs_by_name = (add(), size, last_mod()...);

    return refto(myhash).
}

Now a program can pass in the name of a file that it wants to operate on. The program doesn’t need to know OS differences — those can
be handled in each accessor func. To access them:

<code>var hashref = bigproducerfunc("/etc/passwd")

var href2 = bigproducerfunc("/another pathname/")....

</code>

<code>var hashref = bigproducerfunc("/etc/passwd") var href2 = bigproducerfunc("/another pathname/").... </code>

var hashref = bigproducerfunc("/etc/passwd")

var href2   = bigproducerfunc("/another pathname/")....

“$hashref->size” will use the value that was passed in when the reference list to the function was created and returned — for each
time you call ‘bigproducerfunc’, it can create a list of functions that will operate on that 1 object and only it (no chance at misspellings etc…).

<code>printf("Working name %s size = %s", $hashref->size);

</code>

<code>printf("Working name %s size = %s", $hashref->size); </code>

printf("Working name %s size = %s", $hashref->size);

In this case, you can use that setup for creating a closed
‘environment’ function that will only work on the file you
passed in at creation.

For each file you want to work with, ‘bigfunc’ is called with a file arg, and hands back a pointer to a set of functions that will operate only on the filename they were created with. It is considered
a “closed system” w/r/t the initial data.

This is sometimes called a ‘closure‘ — a function or set of functions that operate on closed data that is given to a ‘function factory’ or ‘method factory’ (bigfunc in our example), that packs up the data with the functions so they later can be used without respecifying the original data each time as well as ensuring what data the functions work with).

===============================================

Those are 2 types of different features you would be giving up by turning it into a ‘dead-semantic’ or ‘dead-syntax’ case that is flagged as an “error” rather than doing something useful with it….the above are two common examples of doing something with this use case, I’m sure there are others that can be created…

Dead-semantics and dead-syntax are “dead-space” in a language.

Take “perl” for example. Perl created a dead spot that would be
harmless to fix, but likely won’t be due to language ossification.

The syntax “string” x (e.g. “hi ” x 5), will create
some number of repetitions of that string concatenated together.

Suppose you wanted to do that some multiple of 5 times, so you pass
in a multiplier:

<code>"string" x 3*$x; # error

</code>

<code>"string" x 3*$x; # error </code>

"string" x 3*$x;  # error

This yields an error in perl because ‘x’ is the same precedence as ‘*’,
thus you can never use this without forcing it to be more cluttered
by adding parenthesis:

<code>"string" x (3*$x); #ok

</code>

<code>"string" x (3*$x); #ok </code>

"string" x (3*$x);  #ok

Now there is no reason to require that — in the operator precedence table, x is at the same precedence as mult+divide, but it really shouldn’t be considered the same because it is less flexible (it isn’t associative and can’t be mixed with mult+divide).

Since strings generally don’t combine with numbers, creating a
precedence for ‘x’ (repetitions op) lower than math operators, would
make sense. In the same way, the string concatenation feature should be lower than “addition + subtraction”, so, while this works:

<code>> perl -MP -we'use strict;

my $a="nblah" . 4*5; #concatenate 'blah w/result of 4*5'

P "%s", $a;'

blah20

</code>

<code>> perl -MP -we'use strict; my $a="nblah" . 4*5; #concatenate 'blah w/result of 4*5' P "%s", $a;' blah20 </code>

> perl -MP -we'use strict;
my $a="nblah" . 4*5;   #concatenate 'blah w/result of 4*5'
P "%s", $a;'

blah20

This:

<code>> perl -MP -we'use strict;

my $a="nblah" . 4+5; #try to concat blah w/result of 4+5..(fail)

P "%s", $a;'

Argument "blah4" isn't numeric in addition (+) at -e line 2.

</code>

<code>> perl -MP -we'use strict; my $a="nblah" . 4+5; #try to concat blah w/result of 4+5..(fail) P "%s", $a;' Argument "blah4" isn't numeric in addition (+) at -e line 2. 5 </code>

> perl -MP -we'use strict;
my $a="nblah" . 4+5;   #try to concat blah w/result of 4+5..(fail)
P "%s", $a;'

Argument "blah4" isn't numeric in addition (+) at -e line 2.
5

Does not.

In both cases, dead semantics were designed into the language.

Not ideal for good use of language semantics. It unnecessarily creates errors where perfectly good meaning can be unambiguously derived.

Hope this gives you some insight into what makes for good language design. (I must note, that knowing what works and doesn’t is always easier to see in hindsight ;-)).

The term “before it is declared” is not really well defined. For example, the following is fine in Haskell:

<code>let

f x = x + y

y = something else

in f 42

</code>

<code>let f x = x + y y = something else in f 42 </code>

let
   f x = x + y
   y = something else
in f 42

What I want to say is: Just because “f” appears earlier in the code when you read it top down, does not mean “it is defined before”. On the contrary, in Haskell and related languages all items in the same declaration group are considered defined (or bound) at the same time.

To answer your question, the negative consequence for the user of your fictional language is that, if you enforce “order of definition” in an arbitrary way, it must order the definitions manually. Another thing that follows from that is that mutually recursive definitions are not possible, or that you handle things with function types differently than things with other types, without good reason.

This sounds much like the original C language. (Plus a whole bunch of other languages from the period.).

You had to define variables functions etc. before you could reference them. “before” meaning strictly “in a previous line of source”.

This is why you see the “main()” function at the back of most C programs and not at the front where it logically belongs.

The language designers did not think this was a great idea! They did this to save the compiler making two passes through the source code, and, so compile programs in a reasonable time on the hardware available.

And no its not really a good idea — it just annoys programmers that they have to define stuff where they would rather not in order to satisfy some picky syntax.

In a language which supports nested scope, what should be the effect of:

<code>int foo()

{

int x,y;

x=1; y=0;

while(y < 5)

{

x=4;

y++;

int x;

x+=23;

}

return x;

}

</code>

<code>int foo() { int x,y; x=1; y=0; while(y < 5) { x=4; y++; int x; x+=23; } return x; } </code>

int foo()
{
  int x,y;
  x=1; y=0;
  while(y < 5)
  {
    x=4;
    y++;
    int x;
    x+=23;
  }
  return x;
}

If, within the inner scope, the usage of x had followed the declaration at that scope level, the meaning of the code would have been clear.

If one abides by the principle that a perfect programming language should refuse to compile anything that wouldn’t behave as the programmer intended, and if one considers it reasonably likely the programmer intended the x=4 to modify the outer-scope x, but also reasonably likely that the programmer intended it to modify the inner one, that would imply that any compiled behavior would be likely to behave in a fashion contrary to programmer intention.

I would suggest that in languages where the ordering of function definitions in the source file has no relationship to the sequence in which they will be called, it would make sense to regard the file as holding an unordered mapping of function signatures to function definitions, along with a bunch of other stuff. As such, if ordering is irrelevant, the concept of “hoisting” is too. For things where ordering is relevant, I would suggest that declaring an identifier within a scope should cause outer declarations of that identifier to become unusable at the start of the scope, but the new declaration should not be usable until after it occurs. Additionally, I would define a syntax for undeclaring an outer-scope identifier within a scope, and warn if code fails to use it [a warning rather than an error, so as to make the definition of a new outer-scope identifier not be a breaking change].

For some purposes, it would be useful to have an explicit way of declaring “temporary” variables with the following semantics:

Temporary variable names may be reused within a scope, but may not appear in the same scope as an “ordinary” variable of the same name.
Use of a temporary variable name within a scope will render any temporary variable of the same name unusable in any outer scope.
Temporary variables must have their value assigned only at the point of declaration.

The basic idea here is to recognize the common situation where a “variable” will, within a certain area of the code, always hold a value that was written in one particular place. Given a construct like:

<code>int result;

...

result = someMethod();