jeudi 6 octobre 2022

Is there a way to get the source columns from a column object in PySpark?

I'm writing a function that (hopefully) simplifies a complex operation for other users. As part of this, the user passes in some dataframes and an arbitrary boolean Column expression computing something from those dataframes, e.g. (F.col("first")*F.col("second").getItem(2) < F.col("third")) & (F.col("fourth").startswith("a")).

The dataframes may have dozens of columns each, but I only need the result of this expression, so it should be more efficient to select only the relevant columns before the tables are joined. Is there a way, given an arbitrary Column, to extract the names of the source columns that Column is being computed from, i.e. ["first", "second", "third", "fourth"]?

I'm using PySpark, so an ideal solution would be contained only in Python, but some sort of hack that requires Scala would also be interesting.

Alternatives I've considered would be to require the users to pass the names of the source columns separately, or to simply join the entire tables instead of selecting the relevant columns first. (I don't have a good understanding of Spark internals, so maybe the efficiency loss isn't as much I think.) I might also be able to do something by cross-referencing the string representation of the column with the list of column names in each dataframe, but I suspect that approach would be unreliable.





Aucun commentaire:

Enregistrer un commentaire