Thiết kế website giá rẻ

Question

Context: I have a C# API that sends an HTTP POST request to the /batches route of Livy, and Livy forwards the arguments to my Scala Spark driver. As far as I know, internally Livy runs the spark-submit command within the container of my Spark master.

The body of the request sent to Livy is as follows:

{
    "file": "/opt/scala-apps/spark-driver-assembly-1.0.0.jar",
    "proxyUser": "X",
    "className": "Y",
    "args": ["a gigantic json"],
    "name": "myTest",
    "conf": {
        "spark.sql.broadcastTimeout": "1500",
        "spark.driver.extraJavaOptions": "-Dlog4j.configuration=file:/opt/scala-apps/livy/conf/log4j.properties -Dguid=111 -Dtimestamp=20240424"
    }
}

Issue: The JSON sent in the arguments to Livy (in “args”) is quite large. I didn’t include it here because it’s not necessary. I know that “args” is an array of strings, and I only pass one arg to the Livy route. When I make this request via Postman (to simulate my API doing this), I receive the following error:

“java.io.IOException: Cannot run program “/opt/spark/bin/spark-submit”: error=7, Argument list too long”

From my interpretation, the internal code of the Livy server cannot handle such a large argument.

HOWEVER, I conducted a test by splitting this JSON into two strings, like this:

    {
    "file": "/opt/scala-apps/spark-driver-assembly-1.0.0.jar",
    "proxyUser": "X",
    "className": "Y",
    "args": [
        "a gigantic json (part 1)",
        "a gigantic json (part 2)"
    ],
    "name": "myTest",
    "conf": {
        "spark.sql.broadcastTimeout": "1500",
        "spark.driver.extraJavaOptions": "-Dlog4j.configuration=file:/opt/scala-apps/livy/conf/log4j.properties -Dguid=111 -Dtimestamp=20240424"
    }
}

In other words, “args” now has two strings. Doing this, the above error does not occur.

With this information, I want some guidance on how to proceed with this problem and a possible solution. Remember that the test I conducted (splitting the JSON into two strings in “args”) is not valid. I don’t want to do this in my code for a few reasons.

P.S.: I also have the impression that it might be some configuration of my container’s shell (or the container’s own operating system) that doesn’t accept such large arguments. Where can I find this information and what can I do?

More infos:

My Spark Driver is configured to have 1 master and 2 workers (each in a different machine). I’m also running my application with Spark Standalone Cluster. Spark version is 3.5.

Here is my spark master conf file:

spark.master spark://myServer:7077
spark.sql.caseSensitive false
spark.executor.heartbeatInterval 90000
spark.network.timeout 400000
spark.executor.heartbeat.maxFailures 10
spark.shuffle.registration.timeout 500000
spark.shuffle.push.finalize.timeout 600s
spark.files.fetchTimeout 600s
spark.rpc.lookupTimeout 600s
spark.scheduler.excludeOnFailure.unschedulableTaskSetTimeout 600s
spark.eventLog.enabled true
spark.eventLog.dir file:/opt/spark/logs/spark-events/
spark.history.fs.logDirectory file:/opt/spark/logs/spark-events/
spark.executor.logsDirectory /opt/spark/logs
spark.sql.adaptive.enabled true
spark.sql.adaptive.skewJoin.enabled true
spark.sql.adaptive.localShuffleReader.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.executor.memory 5g
spark.executor.cores 2
spark.driver.memory 8g
spark.driver.cores 4
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.shuffleTracking.enabled true
spark.dynamicAllocation.executorIdleTimeout 600s
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.port 7078
spark.blockManager.port 7087
spark.driver.blockManager.port 7011
spark.shuffle.service.enabled true
spark.shuffle.service.port 7337
spark.submit.deployMode client
spark.worker.cleanup.enabled true

Here is my livy.conf file (I tried to modify the values of “header.size” but it did not work out):

livy.spark.master = spark://X:7077
livy.spark.deploy-mode = client

# Configure Livy server http request and response header size.
#livy.server.request-header.size = 300000
#livy.server.response-header.size = 300000

livy.server.session.state-retain.sec = 600s
livy.cache-log.size = 1000000
livy.file.local-dir-whitelist = /opt/scala-apps

Here is my docker-compose.yml file:

version: "3.4"

services:
    spark_master:
        container_name: spark_master
        image: apache/spark:3.5.0
        stdin_open: true
        tty: true
        user: root
        network_mode: host
        environment:
            - TZ=Asia/Baghdad
            - SPARK_PUBLIC_DNS=X
        restart: unless-stopped
        volumes:
            - volumes
        ports:
            - ports
        entrypoint: y

    spark_worker:
        container_name: spark_worker
        image: apache/spark:3.5.0
        stdin_open: true
        tty: true
        user: root
        network_mode: host
        environment:
            - TZ=Asia/Baghdad
            - SPARK_MASTER_ADDRESS=spark://X:7077
            - SPARK_WORKER_PORT=7087
            - SPARK_PUBLIC_DNS=
        restart: unless-stopped
        volumes:
            - volumes
        ports:
            - ports
        entrypoint: z

Thiết kế website giá rẻ

Danh mục

Why Livy does not accept a large JSON args