With this example:
data = [{'name': 'Alice', 'age': 1}, {'name': 'Casper', 'age': 2}, {'name': 'Agatha', 'age': 3}]
df = spark.createDataFrame(data)
df_1 = df.select("name", df["age"].alias("q"))
1.- If I do the join in the following way there is no problem:
df.join(df_1, "name").select("name", df.age).show()
2.- But if I do exactly the same join in this way it raises an error saying that column “age” is ambiguous:
df_1.join(df, "name").select("name", df.age).show()
AnalysisException: Column age#0L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
In both cases the join is the same and it only has one column named “age”.
df.join(df_1, "name").show()
+------+---+---+
| name|age| q|
+------+---+---+
|Agatha| 3| 3|
| Alice| 1| 1|
|Casper| 2| 2|
+------+---+---+
Why the select in the second case throws an error of column ambiguous if there is only one column named “age”?
And why it only throws the error in the second case?