Amazon Transcribe gotcha

I needed to transcribe some minutes from a meeting, and only one person was speaking during a particular three minute piece. So I copied that segement out to it’s own MP3 file.

I uploaded the file to s3:// and ran a default transcription job. Whoops.

By default, I mean that, mostly I clicked Next, Next, Next. I supplied a job name, an input file, and an output file. (That I used an output file location other than default means it wasn’t exactly default settings).

After the transcription job was done, because I had not specified the quantity of speakers, it left out the 'speaker_labels' data from the JSON file.

I have been using https://github.com/trhr/aws-transcribe-transcript/transcript.py to simplify the JSON into text, but it does not handle files with missing speaker labels.

Sigh. Now I have to re-do the transcription, which will incur another charge. Those speaker_labels are all over the file when present.

For what it is worth, the tasks were essentially:

  1. Upload the file to S3
    1. aws s3 cp /home/david/Documents/some_path/review_of_previous_board_meeting.mp3 s3://some_s3_bucket/
  2. Log in to Amazon Transcribe and create a job
    1. Job name was review_of_previous_board_meeting
    2. Input file was s3://some_s3_bucket/review_of_previous_board_meeting.mp3
    3. Output file was s3://some_s3_bucket/review_of_previous_board_meeting.json
      1. This did require clicking the button “Customer specified S3 bucket”
      2. I used the AWS CLI commands to copy between my local machine and the S3 bucket, so it is easier if I name the bucket I want the files in.
    4. Click Next
    5. THE IMPORTANT PIECE: Audio Identification = On, and audio identification type = speaker identification
      1. Stupidly, you have to define the count of speakers, and 1 single speaker is an invalid minimum. So I have to tell it there were two speakers, when I had clipped the MP3 file to only contain one.
  3. Download the file from S3
    1. aws s3 cp s3://some-s3-bucket/review_of_previous_board_meeting.json /home/david/Documents/some_path/
  4. Clean up the transcription
    1. transcript.py /home/david/Documents/some_path/review_of_previous_board_meeting.json
    2. And then transcript.py runs without errors. The result is file review_of_previous_board_meeting.json.text

Leave a Reply