개발/Data Engineering
[개발] spark에서 특정 조건의 최신 row들 가져오기
가을기_
2021. 6. 2. 13:01
Scala 코드 기준.
val simpleData = Seq(("James","Sales",3000),
("Michael","Sales",4600),
("Robert","Sales",4100),
("Maria","Finance",3000),
("Raman","Finance",3000),
("Scott","Finance",3300),
("Jen","Finance",3900),
("Jeff","Marketing",3000),
("Kumar","Marketing",2000)
)
import spark.implicits._
val df = simpleData.toDF("Name","Department","Salary")
df.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
| Robert| Sales| 4100|
| Maria| Finance| 3000|
| Raman| Finance| 3000|
| Scott| Finance| 3300|
| Jen| Finance| 3900|
| Jeff| Marketing| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
val w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number.over(w2))
.where($"row" === 1).drop("row")
.show()