Skip FOLIO Project Navigation

Improving Glint uploads


#1

Streaming

With the current API it is cumbersome to wrap the CSV dataset in a JSON structure, especially as quotes will need to be escaped. If we have the metadata in the headers and consider the whole body exclusively data, it becomes trivial to stream the file without needing to read the whole thing into memory.

On the server side we can use Go’s io.Copy() to copy from a Reader to a Writer. For the client, we can actually pass a File directly to fetch because it inherits from Blob which, according to the spec, fetch knows how to stream as the request body.

Neat trick I found while confirming this: time -v will give you the maximum resident set size of a process via the wait(3) syscall. At least on Linux. Though weirdly I had to specify the absolute path /usr/bin/time explicitly, maybe Bash has a builtin for this? Apparently its accuracy varies though so I ran an 8GB file through while watching top just to be sure. Through what? I built a simple example of the above approach for the performance section.

Progress indicator

Turns out this is easy enough and doesn’t require any support on the server. It just means using XMLHttpRequest so we can take advantage of ProgressEvent because it’s currently rather more fiddly with fetch.

Resume

It’d be nice to be able to resume a partially complete upload eventually as one could conceiveably want to use Glint from the field with slow or intermittent conditions.

RFC 7233 seems to offer this for GET only via Range. The most common approach for uploads seems to be extending the prototcol with some custom metadata to delimit chunks and stitch it back together on the server. tus.io is trying to become a standard for this and there is a Go implementation and JavaScript client. We’d of course still want to always support pure HTTP in the absence of TUS headers and it seems almost overbuilt, but nice to know there’s a drop in tool if we need to support this in a pinch.

Sorta hoped HTTP/2 would offer resumable uploads considering it breaks data up into frames to enable its multipexing and so is most of the way there as it’s already doing the chunking. Alas, so far as I can tell it does not.

Performance

Speaking of HTTP/2… saw something alarming when trying to discover if it could do resume. A StackOverflow post claims that browsers aren’t adapting the flow control for large uploads though they do for downloads. And I couldn’t find any reference to this or API in the browser for explicitly sending HTTP/2 frames like WINDOW_UPDATE. So I tested it.

I uploaded a roughly 2GB file from a ramdisk back to the same ramdisk via the client/server code below and compared unecrypted HTTP 1.1, HTTP 1.1 w/TLS and HTTP 2 w/TLS in both Firefox on Chromium.

Firefox HTTP 1.1:
2140541808 bytes recieved in 1.9308 s at 1057.2922 MiB/s.
2140541808 bytes recieved in 1.9259 s at 1059.9530 MiB/s.
2140541808 bytes recieved in 2.0499 s at 995.8453 MiB/s.

Chromium HTTP 1.1:
2140541808 bytes recieved in 4.7987 s at 425.4035 MiB/s.                                                      
2140541808 bytes recieved in 4.8256 s at 423.0354 MiB/s.                                                      
2140541808 bytes recieved in 4.8614 s at 419.9197 MiB/s.


Firefox HTTP 2 TLS
2140541808 bytes recieved in 14.3601 s at 142.1562 MiB/s.
2140541808 bytes recieved in 14.5017 s at 140.7685 MiB/s.
2140541808 bytes recieved in 14.4484 s at 141.2877 MiB/s.

Chromium HTTP 2 TLS
2140541808 bytes recieved in 27.2463 s at 74.9233 MiB/s.
2140541808 bytes recieved in 27.4147 s at 74.4631 MiB/s.
2140541808 bytes recieved in 27.1450 s at 75.2028 MiB/s.


Firefox HTTP 1.1 TLS (GODEBUG=http2server=0)
2140541808 bytes recieved in 32.7694 s at 62.2953 MiB/s.
2140541808 bytes recieved in 32.8588 s at 62.1259 MiB/s.
2140541808 bytes recieved in 32.6510 s at 62.5213 MiB/s.

Chromium HTTP 1.1 TLS (GODEBUG=http2server=0)
2140541808 bytes recieved in 9.1313 s at 223.5585 MiB/s.
2140541808 bytes recieved in 8.8481 s at 230.7136 MiB/s.
2140541808 bytes recieved in 8.7701 s at 232.7658 MiB/s.

So that’s weird :slight_smile: Just to help narrow down if this is flow control related, I tried to test it explicitly. For some reason curl in Ubuntu 17.10 isn’t built with HTTP/2 support so I grabbed nghttp2-client and set the window size to the 2^30 byte maximum like so:

nghttp -H':method: PUT' -d <filename> -w 30 https://localhost:3443/

And it’s slower. So I surmise there are far more variables at work here:

2140541808 bytes recieved in 34.0039 s at 60.0337 MiB/s.                                                      
2140541808 bytes recieved in 34.2830 s at 59.5449 MiB/s.                                                      
2140541808 bytes recieved in 34.2553 s at 59.5931 MiB/s.

Not sure any of that is of particular short term interest beyond getting the API right but seems like TLS in general is more of a performance hit than any flow control issues with HTTP/2. Besides, speaking of flow control, on the off chance we ever intend this to ingest massive datasets, TCP itself will prove to be a bottleneck if there’s much latency.

Source

fetch client

<html><body>
  <script>
    function doUpload() {
      const file = document.getElementsByTagName('input')[0].files[0];
      fetch('/', { method:'PUT', body:file }).then(res => console.log(res));
    }
  </script>
  <input type="file" />
  <button onclick="doUpload()">Upload</button>
</body></html>

XHR client

<html><body>
  <script>
    function doUpload() {
      const file = document.getElementsByTagName('input')[0].files[0];
      const xhr = new XMLHttpRequest();
      xhr.open('PUT', '/');
      xhr.upload.onprogress = e => console.log(e.loaded/e.total);
      xhr.send(file);
    }
  </script>
  <input type="file" />
  <button onclick="doUpload()">Upload</button>
</body></html>

server

package main

import (
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"
)

func response(rw http.ResponseWriter, req *http.Request) {
	if req.Method == "GET" {
		file, err := os.Open("./fetch.html")
		// file, err := os.Open("./xhr.html")
		defer file.Close()
		if err != nil {
			panic(err)
		}
		io.Copy(rw, file)
	} else if req.Method == "PUT" {
		start := time.Now()
		file, err := os.Create("/tmp/httpout")
		defer file.Close()
		if err != nil {
			panic(err)
		}
		n, err := io.Copy(file, req.Body)
		if err != nil {
			panic(err)
		}

		message := fmt.Sprintf("%d bytes recieved in %.4f s at %.4f MiB/s.\n", n,
			float64(time.Since(start))/1000000000,
			(float64(n)/1024/1024)/(float64(time.Since(start))/1000000000))
		log.Print(message)
		rw.Write([]byte(message))
	}
}

func main() {
	http.HandleFunc("/", response)
	err := http.ListenAndServe(":3000", nil)
	// err := http.ListenAndServeTLS(":3443", "localhost.crt", "localhost.key", nil)
	log.Fatal(err)
}

#2

An interesting little HTTP2 tidbit: you can start your response before the request is complete. So if you want to transform some data on the server you don’t necessarily have to upload the whole file before downloading the transformed version and can instead, if your transformation can be streamed, download the transformed data while the input is being uploaded. Not sure if this really plays into Glint’s core featureset but thought I’d mention it. Maybe useful for something like computed columns based on other datasets, etc.