2024-03-24

AWS CLIによるRDS Data APIの実行

AWS PostgreSQL

前回作成したAurora PostgreSQL Serverless v2に対して、動作確認のためAWS CLIを用いてRDS Data APIを実行してみる。

環境

AWS CLI v2.15.30、jq v1.7.1。シェルスクリプトはGit for Windows v2.42.0.windows.1のGit Bashで実行。

サンプルコード

以下、実行用のシェルスクリプト。クエリAPIを使えばもっときれいに書けそうだが、面倒だったのでjqで値を抽出している。

前回、RDSのクエリエディタを実行できたIAMユーザーであれば実行可能。

#!/bin/bash

DATABASE_NAME=aurora_serverless_example
STACK_NAME=BackendStack

# DBクラスターのARN、シークレットのARNが必要
AURORA_CLUSTER_ARN=$(aws rds describe-db-clusters |
  jq -r ".DBClusters[] | select (.DatabaseName | contains(\"${DATABASE_NAME}\")) | .DBClusterArn"
)

# Secrets Managerからの取得に使えそうな情報がないので、Descriptionで抽出
AURORA_SECRET_ARN=$(aws secretsmanager list-secrets |
  jq -r ".SecretList[] | select (.Description | contains(\"Generated by the CDK for stack: ${STACK_NAME}\")) | .ARN"
)

aws rds-data execute-statement \
  --database "${DATABASE_NAME}" \
  --resource-arn "${AURORA_CLUSTER_ARN}" \
  --secret-arn "${AURORA_SECRET_ARN}" \
  --format-records-as 'JSON' \
  --sql 'SELECT user_id, user_name FROM example.users ORDER BY user_id;'

execute-statementの説明

aws rds-data execute-statement で、SQLの実行が可能。デフォルトリージョン以外に対して実行する場合、 --region <リージョン名> で変更できる。

必須のオプションは --resource-arn <DBクラスターARN> 、--secret-arn <DB接続用Sercret ARN> 、 --sql <実行SQL> の3つ。

--database は接続するデータベース名、 --format-records-as はレスポンスの形式。デフォルトの NONE と JSON の2種類だが、NONEだと結果に列名が含まれないため、JSONを指定しておく。

--schema オプションも宣言されているが、現時点では使用できない。

$ aws rds-data execute-statement help | grep schema
   [--schema <value>]
"--schema" (string)
   The name of the database schema.
   Note: Currently, the "schema" parameter isn't supported.

試しに指定すると、以下のエラーが発生した。

An error occurred (ValidationException) when calling the ExecuteStatement operation: The schema parameter isn't supported.

実行結果

以下の結果が返る。 formattedRecords に、文字列として結果が返ってくる。

{"numberOfRecordsUpdated":0,"formattedRecords":"[{\"user_id\":1,\"user_name\":\"user1\"},{\"user_id\":2,\"user_name\":\"user2\"},{\"user_id\":3,\"user_name\":\"user3\"}]"}

なお、 --format-records-as を指定しない場合、以下の結果が返る。JSON形式だが、列名が含まれない。

{"records":[[{"longValue":1},{"stringValue":"user1"}],[{"longValue":2},{"stringValue":"user2"}],[{"longValue":3},{"stringValue":"user3"}]],"numberOfRecordsUpdated":0}

どちらも微妙に使いにくいが、列名が含まれないよりは文字列で返ってくるほうが扱いやすいと思う。

振り返り

スキーマ指定ができなかったり、実行結果のフォーマットがいまいちだったりと、まだまだこなれていない印象。

スキーマ指定だけでも実装してもらえるとありがたいなぁ。

2024-03-16

AWS CDKによるAurora PostgreSQL Serverless v2およびRDS Data APIの環境構築

AWS PostgreSQL

Aurora Serverless v2のPostgreSQL互換でRDS Data APIが使えるようになった。

最小ACUを0にできないなど、Serverless v1と完全互換ではないが、ちょうどSalesforceやHTTPでAPI連携できるRDBを使いたい要件があったので、検証することに。

Secrets Managerへの認証情報の登録など、RDS Data APIを使うには準備が必要だが、AWS CDKで環境構築すると自動でやってくれたのでメモ。

環境
AWS CDKプロジェクトの初期化
Stackの記述
デプロイ
動作確認
振り返り

環境

AWS CDK Toolkit v2.133.0、AWS CLI v2.15.30。

AWS CDKプロジェクトの初期化

AWS CDKのインストール等は省略。以下のコマンドでプロジェクトを初期化。

mkdir -p aurora-serverless-example/backend
cd aurora-serverless-example/backend

cdk init app --language typescript

git add package-lock.json
git commit -m 'add package-lock.json'

Stackの記述

作成された aurora-serverless-example/backend/lib/backend-stack.ts に以下を記述。

import { Stack, StackProps } from 'aws-cdk-lib'
import { IpAddresses, SubnetType, Vpc } from 'aws-cdk-lib/aws-ec2'
import { AccessKey, User } from 'aws-cdk-lib/aws-iam'
import {
  AuroraPostgresEngineVersion,
  ClusterInstance,
  DatabaseCluster,
  DatabaseClusterEngine,
} from 'aws-cdk-lib/aws-rds'
import { StringParameter } from 'aws-cdk-lib/aws-ssm'

import { Construct } from 'constructs'

export class BackendStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props)

    // VPC
    const vpc = new Vpc(this, 'Vpc', {
      vpcName: 'AuroraServerlessExample',
      ipAddresses: IpAddresses.cidr('172.30.0.0/16'),
      maxAzs: 2,
      subnetConfiguration: [
        {
          cidrMask: 24,
          subnetType: SubnetType.PRIVATE_ISOLATED,
          name: 'PrivateIsolated',
        },
      ],
    })

    // RDS
    const databaseCluster = new DatabaseCluster(this, 'Aurora', {
      engine: DatabaseClusterEngine.auroraPostgres({
        version: AuroraPostgresEngineVersion.VER_15_5,
      }),
      serverlessV2MinCapacity: 0.5,
      serverlessV2MaxCapacity: 1,
      writer: ClusterInstance.serverlessV2('writer'),
      readers: undefined,
      enableDataApi: true,
      iamAuthentication: true,
      storageEncrypted: true,
      defaultDatabaseName: 'aurora_serverless_example',
      vpc,
      vpcSubnets: vpc.selectSubnets({
        subnetType: SubnetType.PRIVATE_ISOLATED,
      }),
    })

    // IAM User
    const iamUser = new User(this, 'User')
    databaseCluster.grantDataApiAccess(iamUser)
    databaseCluster.secret?.grantRead(iamUser)

    // AccessKey
    const accessKey = new AccessKey(this, 'AccessKey', {
      user: iamUser,
    })

    const ssmAccessKeyId = 'AccessKeyId'
    new StringParameter(this, ssmAccessKeyId, {
      parameterName: ssmAccessKeyId,
      stringValue: accessKey.accessKeyId,
    })

    const ssmSecretAccessKey = 'SecretAccessKey'
    new StringParameter(this, ssmSecretAccessKey, {
      parameterName: ssmSecretAccessKey,
      stringValue: accessKey.secretAccessKey.unsafeUnwrap(),
    })
  }
}

検証用なのでいろいろ適当だが、以下解説。

VPCの作成

RDS Data APIはエンドポイント経由で実行するためか、サブネットタイプは PRIVATE_ISOLATED でよかった。

AZは2以上指定しないと、DatabaseCluster生成時にエラーが発生するため、最小値の2としている。(東京リージョンで subnetConfiguration に1種類しか指定しない場合、 maxAzs を指定しなくてもAZは2つになる模様)

DatabaseClusterの作成

検証のため、ACUはMin/Maxそれぞれ最小値にし、リーダーインスタンスは未設定。なお、RDS Data APIは実行するSQLの種類によらず、ライターインスタンス経由で実行される模様。

docs.aws.amazon.com

enableDataApi に true を指定すると、RDS Data APIが有効となる。また、上記リンクやこちらにあるように、RDS Data APIの実行にはSecrets Managerにデータベース側のユーザーの認証情報が必要だが、CDKを使うと DatabaseCluster の第3引数の credentials で指定した認証情報が、自動的にSecrets Managerに保存される。

上記のコードでは credentials を省略しているため、ユーザー名はデフォルトの postgres となる。別のユーザー名を指定したい場合、以下のように記述可能。

import { Credentials } from 'aws-cdk-lib/aws-rds'

new DatabaseCluster(this, 'Aurora', {
  credentials: Credentials.fromGeneratedSecret('userName'),
}

iamAuthentication にも true を指定しているが、これはIAM データベース認証を有効化するかのプロパティ。RDS Data APIの実行時には、Signature Version 4による署名が必須なので、IAMユーザーが必要となる。

IAMユーザーの作成

RDS Data APIの実行に必要な権限を持ったIAMユーザーを作成する。検証なので、直接IAMユーザーに権限追加している。

RDS Data API へのアクセスの承認には、

AWS 管理ポリシー AmazonRDSDataFullAccess には、Data API のアクセス許可が含まれています。

との記載があるが、以下のように AmazonRDSDataFullAccess ポリシーを付与しても、RDS Data API実行時にSecrets Managerの読み取り権限なしでエラーとなる。

new User(this, 'User', {
  managedPolicies: [
    ManagedPolicy.fromAwsManagedPolicyName('AmazonRDSDataFullAccess'),
  ],
})

原因は、 AmazonRDSDataFullAccess 内で読み取り可能なシークレットのARNが arn:aws:secretsmanager:*:*:secret:rds-db-credentials/* に限定されているため。DatabaseClusterのcredentialsから作成されたシークレットのARNは、 arn:aws:secretsmanager:*:*:secret:<DatabaseClusterのID>Secret<ランダム文字列8桁> となるため、一致しない。

credentialsから作成されたシークレットは databaseCluster.secret で参照可能なので、 databaseCluster.grantDataApiAccess(iamUser) でDBクラスターへのRDS Data API実行権限、 databaseCluster.secret?.grantRead(iamUser) でシークレットの読み取り権限をそれぞれIAMユーザーに追加した。

アクセスキーの作成

AWS Signature Version 4で使用するアクセスキーを作成し、AccessKeyIdとSecretAccessKeyをそれぞれ保存する。

SecretAccessKeyはSecrets Managerに保存したほうがいいのだろうが、検証なのでAccessKeyIdと同じくSystems Managerに保存している。

デプロイ

AWS CLIでログインし、 cdk synth および cdk deploy でデプロイ。東京リージョンへのデプロイの場合、15分弱で完了した。

動作確認

RDSのクエリエディタはRDS Data APIを用いてSQLを実行するため、クエリエディタを用いることで動作確認可能(IAMにRDS Data APIの実行権限およびシークレットの参照権限がある前提)。

cdk.out/BackendStack.template.json を参照するかSecrets Managerを管理コンソールで開くなどして、作成されたシークレットのARNを確認しておく。

AWSの管理コンソールからRDSを開き、左フレームの「クエリエディタ」をクリック。以下の手順でクエリエディタを起動。

作成したDBクラスターを選択
「データベースユーザー名」で「Secrets Manager ARN と接続する」を選択
「Secrets manager ARN」に、確認したシークレットのARNを設定
「データベースの名前を入力」に、DB名を設定。今回は aurora_serverless_example 。

以下のSQLで、DDLおよびDMLが実行できることを確認する。

CREATE SCHEMA example;

CREATE TABLE example.users (
 user_id BIGSERIAL NOT NULL,
 user_name VARCHAR(50) NOT NULL,
 PRIMARY KEY (user_id)
);

INSERT INTO example.users (user_name) VALUES ('user1');
INSERT INTO example.users (user_name) VALUES ('user2');
INSERT INTO example.users (user_name) VALUES ('user3');

SELECT user_id, user_name FROM example.users ORDER BY user_id;

なお、クエリエディタではステートメントごとに別リクエストで実行されるのか、 SET search_path TO example; などを挟んでも効果がないため、それぞれスキーマを指定している。

振り返り

初CDKなので、これでいいのやら。ただ、シークレットを自動で設定してくれたりと、便利なのは間違いない。

昔CloudFormationを手書きしていたのに比べると、格段に楽になったなぁ。

2024-02-11

Azure OpenAI Serviceの利用申請

Azure

Microsoft AzureのAzure OpenAI On Your Dataを検証することとなった。

Azure OpenAI Serviceの利用には、事前申請が必要ということで、2024/2/6時点での申請方法をメモ。

事前準備
申請内容
申請結果
振り返り

事前準備

2024/2/6時点の申請フォームはこちら。

利用申請はサブスクリプション単位。申請フォームに有効化するサブスクリプションのSubscriptions IDを入力する必要があるため、すべてのサービス > サブスクリプションから、IDを確認しておく。

また、メールアドレスの入力が必要だが、 gmail.com, hotmail.com などは使用できず、会社のものを使用する必要がある模様。

Your Company Email Address - Applications submitted with a personal email address (e.g. gmail.com, hotmail.com, outlook.com, etc.) will be DENIED.

申請内容

引っ掛かった項目をメモしていく。なお、名前や会社名、会社の住所などは、日本語で入力して申請が通った。

14, 15: If you have a contact at Microsoft, ...

14「If you have a contact at Microsoft, please provide their full name.」および15「If you have a contact at Microsoft, please provide their email address.」。よくわからないため、空白にしておく。

17: Which Azure OpenAI service feature(s) are you requesting access for

17「Which Azure OpenAI service feature(s) are you requesting access for?」に対して以下の選択肢、複数選択可能。検証に使用するのはGPT-3.5系だけでいいため、一番上のみチェック。

GPT-3.5, GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, and/or Embeddings Models (Conversational AI, Search, Summarization, Writing Assistance or content generation, Code-based scenarios, Reason over Structured and Unstructured data)
DALL-E 2 and/or DALL-E 3 models (text to image)
OpenAI Whisper model (Speech-to-Text)
GPT-4 Turbo with Vision

18: ユースケース

17でWhisper以外を選択すると、18としてユースケースを聞かれる。

GPT-3.5...を選択した場合以下。1, 2, 6, 8を選択。

Chat and conversation interaction: Users can interact with a conversational agent that responds with responses drawn from trusted documents such as internal company documentation or tech support documentation; conversations must be limited to answering scoped questions. Available to internal, authenticated external users, and unauthenticated external users.
Chat and conversation creation: Users can create a conversational agent that responds with responses drawn from trusted documents such as internal company documentation or tech support documentation; conversations must be limited to answering scoped questions. Limited to internal users only.
Code generation or transformation scenarios: For example, converting one programming language to another, generating docstrings for functions, converting natural language to SQL. Limited to internal and authenticated external users.
Journalistic content: For use to create new journalistic content or to rewrite journalistic content submitted by the user as a writing aid for pre-defined topics. Users cannot use the application as a general content creation tool for all topics. May not be used to generate content for political campaigns. Limited to internal users.
Most Valuable Professional (MVP) or Regional Director (RD) Demo Use: Any applicant who is not in the Microsoft Most Valuable Professional (MVP) Award Program and in the MVP database, or in theRegional Director (RD) Program, will be denied if this use case is selected. For use by a current participant in the MVP or RD Program (the name entered in Questions 1-2 must be the name of the MVP or RD participant) solely to develop, test, and demonstrate one or more sample applications showcasing the Azure OpenAI Service GPT-3.5, GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, and/or Embeddings Models capability (in accordance with a use case listed in this Question [X]). No production use, sale, or other disposition of an application is permitted under this use case; if an MVP, RD, or their employer wants to use an Azure OpenAI Service application in production, a separate form must be submitted, the appropriate use case must be selected, and a separate eligibility determination will be made.
Question-answering: Users can ask questions and receive answers from trusted source documents such as internal company documentation. The application does not generate answers ungrounded in trusted source documentation. Available to internal, authenticated external users, and unauthenticated external users.
Reason over structured and unstructured data: Users can analyze inputs using classification, sentiment analysis of text, or entity extraction. Examples include analyzing product feedback sentiment, analyzing support calls and transcripts, and refining text-based search with embeddings. Limited to internal and authenticated external users.
Search: Users can search trusted source documents such as internal company documentation. The application does not generate results ungrounded in trusted source documentation. Available to internal, authenticated external users, and unauthenticated external users.
Summarization: Users can submit content to be summarized for pre-defined topics built into the application and cannot use the application as an open-ended summarizer. Examples include summarization of internal company documentation, call center transcripts, technical reports, and product reviews. Limited to internal, authenticated external users, and unauthenticated external users.
Writing assistance on specific topics: Users can create new content or rewrite content submitted by the user as a writing aid for business content or pre-defined topics. Users can only rewrite or create content for specific business purposes or pre-defined topics and cannot use the application as a general content creation tool for all topics. Examples of business content include proposals and reports. May not be selected to generate journalistic content (for journalistic use, select the above Journalistic content use case). Limited to internal users and authenticated external users.
Data generation for fine-tuning: Users can use a model in Azure OpenAI to generate data which is used solely to fine-tune (i) another Azure OpenAI model, using the fine-tuning capabilities of Azure OpenAI, and/or (ii) another Azure AI custom model, using the fine-tuning capabilities of the Azure AI service. Generating data and fine-tuning models is limited to internal users only; the fine-tuned model may only be used for inferencing in the applicable Azure AI service and, for Azure OpenAI service, only for customer’s permitted use case(s) under this form.

ユースケース以降の項目

ユースケースの次(今回の場合は19)は利用規約の確認チェック。

20はコンテンツフィルターによってフラグが立てられた場合、デバッグや不正使用の調査のため、Microsoftの従業員がコンテンツを確認する場合があることへの同意チェック。

21はアンケートなので記入不要。

申請結果

左下の「送信」ボタンをクリックすると以下のメッセージ。だいたい1日で終わるとのこと。

Most applications are processed within 24 hours. Some applications may require additional processing time and take up to 10 business days

メッセージ通り、翌日には「Welcome to the Azure OpenAI Service, <名前>! [ApplicationID <7桁の数字>]」という件名で、申請が通ったとのメールが来た。

振り返り

設問がすべて英語なので、ユースケースの内容を理解するのが大変だった。

他は一般的な利用申請という感じだが、会社メールアドレスでないと2024/2時点では申請できそうにないので、個人で試すのは難しそう。

2023-12-08

Yarn v1を使用しているとStorybook v7の実行がERR_REQUIRE_ESMで失敗する

Storybook Node.js

昨日まで動いていたStorybookが、 yarn install したら突然動かなくなったのでメモ。

環境

Node.js v20.10.0, Yarn 1.22.19, Storybook v7.5.3。

問題

Next.js v13プロジェクトで、Storybookを使ってコンポーネント等の確認をしている。

今回、新しいパッケージの検証のため、 yarn add や yarn remove 、 yarn install を繰り返した。

その後、 yarn storybook でStorybookを起動しようとすると、以下のエラーが発生し、起動できなくなった。

:red_circle: Error: It looks like you are having a known issue with package hoisting.
Please check the following issue for details and solutions: https://github.com/storybookjs/storybook/issues/22431#issuecomment-1630086092


<workspace>/node_modules/cli-table3/src/utils.js:1
const stringWidth = require('string-width');
          ^

Error [ERR_REQUIRE_ESM]: require() of ES Module <workspace>/node_modules/string-width/index.js from <workspace>/node_modules/cli-table3/src/utils.js not supported.
Instead change the require of index.js in <workspace>/node_modules/cli-table3/src/utils.js to a dynamic import() which is available in all CommonJS modules.
  at Object.<anonymous> (<workspace>/node_modules/cli-table3/src/utils.js:one:21) {
 code: 'ERR_REQUIRE_ESM'
}

npm run storybook でも同様のエラーが発生する。

対応

エラーメッセージ内にGitHubのissueへのリンクがあるので、そちらを確認。

github.com

このメッセージに、経緯や原因、対応方法の記載がある。

以下のような経緯らしい。(軽く眺めただけなので間違っているかも)

cliuiの変更に起因して、jackspeak v2.1.2以降ではYarn v1非対応のフォークである@isaacs/cliui v8.0.2を使うよう更新された
Storybookが使用しているglobからjackspeakへの依存性が jackspeak@^2.0.3 となっており、jackspeak v2.1.2以降を使用するため、Yarn v1を使用しているとエラーが発生する

根本的な対応として推奨されているのは、Yarnをv3に更新することだが、Yarn v1のままで対応したい場合、 package.json に resolutions を追加し、 jackspeak のバージョンを 2.1.1 で指定する。

{
  ...
  "resolutions": {
    "jackspeak": "2.1.1"
  }
}

上記の記述を package.json に追加し、 yarn install を再実行すると、Storybookの起動時のエラーが発生しなくなった。

振り返り

突然動かなくなったので焦ったが、エラーメッセージにURLを含めてくれていたので非常に助かった。

もう2年くらい、「Yarn v1から移行しないとな～」とか言ってきたが、こういう問題が起こってくるといいかげん対応しないとだなあ。

2023-10-12

AWS AmplifyでNode.js v18を使うと「GLIBC not found」が発生する

Amplify Node.js Next.js

最近新たに開始したプロジェクトで、開発環境をNode.js v18 + Next.js v13に更新した。検証のためにAWS Amplifyにデプロイしたがエラーが発生。

ちょっと調べるとあるあるらしいので、何番煎じかわからないが対応方法をメモ。

環境

Node.js v18.16.1, Next.js v13.4.19。

問題

AWS Amplifyに前述のバージョンのNext.jsアプリケーションをホストした。ビルドの設定は以下。

構築イメージ: Amazon Linux:2 (デフォルト)
ライブパッケージの更新
1. Next.js version: 13.4.19
2. Node.js version: 18.16.1

この状態でデプロイすると、以下のエラーが「構築」>「フロントエンド」で発生し、アプリケーションのビルドに失敗する。

[WARNING]: node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)
           node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)  
[ERROR]: !!! Build failed

なお、AmplifyはNext.js v13に対応済み。

aws.amazon.com

原因

「node GLIBC not found」で検索すると、同様の事例が出てくる。OSにインストールされているGLIBC(GNU C Library)が古いのが原因らしい。

コンテナイメージとしてはデフォルトのAmazon Linux 2を指定していた。Amazon Linux 2のベースと思われるCentOS7でも同様の問題が発生する模様。

it.ama2pro.net

対応

コンテナイメージを変更してやればいい。ECR Public GalleryでNode.jsの公式イメージが使用できる。

今回はv18.16.1を使うので、構築イメージのURLとして public.ecr.aws/docker/library/node:18.16.1 を指定。

これでビルドが通るだろうと思ったが、今度は [ERROR]: !!! Node version not available: 18.16.1 というエラーが発生。

「ライブパッケージの更新」でNode.jsのバージョンを指定しているのが悪いのかと思い、 Node.js version を削除したところ、エラーが発生しなくなった。

振り返り

今回の問題が起こるまで、Amplifyのアプリケーション実行環境を気にしていなかった。適切なコンテナイメージがあれば、それを指定したほうがいいかもしれない。

Amazon Linux 2、サポート期限が2023年6月末じゃなかったっけと思ったら、2年延長してたのね。正式リリースが2018年6月、それだけ古ければ、そろそろ動かないものが出てくることもあるよなぁ。

2023-07-09

Whisper APIで解析した結果から、Google Colabで話者ダイアリゼーション(話者の識別)を行う

Python OpenAI

Whisperで文字起こしした電話内容を、OpenAI APIの gpt-3.5-turbo モデルで要約させている。

会議議事録の要約などは、プロンプトの指定次第で高精度で行えるが、電話の場合、文章の体裁を取らず、会話内容もまとまりがなかったりで、精度が低い。

特に、話者がどちらかの情報が、Whisperでの文字起こし結果には含まれないため、双方の話している内容がまとめられたりと、改善要望が多い。

ちょうど gpt-4 がOpenAI APIで一般解禁されたので試してみたところ、入力テキストやプロンプトをいじらなくても精度向上したが、コストが gpt-3.5-turbo の20倍ということで、稟議が通らなかった。

精度向上のため、Speaker Diarisation(話者ダイアリゼーション、話者識別、話者分離、話者分割)できないか、場合によってはWhisperからの乗り換えも検討しつつ調べたところ、Whisperの解析結果を元に、Google Colab上でそれっぽいことができたのでメモ。

方法の調査
方法の選定
spkrec-ecapa-voxcelebによる話者ダイアリゼーションのGoogle Colaboratoryでの実装
振り返り

方法の調査

調べると、Pythonを使った例が出てきた。 spkrec-ecapa-voxceleb 、 Pyannote 、 WhisperX あたりが名前として出てくる。

spkrec-ecapa-voxceleb

spkrec-ecapa-voxceleb は、SpeechBrainが公開しているツール。トレーニング済みのECAPA-TDNNモデルで話者の検証を行うのに必要なツールが含まれているとのこと。

huggingface.co

VoxCeleb はYoutubeを元にしたデータセットらしい。ECAPA-TDNNはモデルの模様。

Pyannote

今回のユースケースのように、音声に対して使用するのは pyannote.audio 。他にも動画用の pyannote.video などあった。

github.com

PyTorchベースの、speaker diarization用ツールとのこと。

READMEに記載があるが、使用時に認証トークンが必要。その発行のためには、HuggingFaceのアカウントが必要となる。

WhisperX

Whisper + Pyannoteによる文字起こし&話者分離が広く使われているようで、それらをあらかじめ合わせてあるのがWhisperX。

github.com

Whisperの高速化モデルである、Faster-Whisperを使用しており、オリジナルのWhisperを使っているのではない模様。

Whisper APIを使用できたらよかったのだが、現時点では未対応の模様。また、Pyannoteの認証トークンはやはり必要となる。

話者ダイアリゼーション可能なサービス

Whisperから、話者ダイアリゼーションをしてくれるサービスへの乗り換えも検討。

GCPのSpeech-to-Textや、LINEのCLOVA Note、Nootaというサービスもあった。

方法の選定

PyannoteのためのHuggingFaceアカウント作成が、「(決済者が)聞いたことのないWebサイトだから」ということで稟議が通らず、外部サービスについてもそれぞれの理由でお見送りとなったため、消去法でspkrec-ecapa-voxcelebに決定。

サービスの中でも、特にNootaは要約やSalesforce連携(要約結果をカスタムオブジェクトに保存している)など、自前で実装したことに対応しており、かつ話者識別もできるということで乗り換えたかったが、解析対象となる電話の総時間がビジネスプランの最大月12,000分でも足りず、月額30,000円以上のエンタープライズプランになるため見送りとなった。

CLOVA Noteはオープンベータらしく、利用料は無料でスマートフォンアプリ経由であれば無制限に使用可能だが、ファイルアップロード形式だと月300分まで、データ利用を許可しても+300分のトータル月600分までなので見送り。

Speech-To-Textは軽く試してみたが、60秒以上の音声ファイルはGoogle Cloud Storageに保存する必要があるのと、単純にエラーが多かった(時折変換失敗するファイルが存在し、リトライしても失敗し続けてしまう)。また、精度もWhisperより低かったように感じる。

spkrec-ecapa-voxcelebによる話者ダイアリゼーションのGoogle Colaboratoryでの実装

以下で、話者のサンプル音声をもとにした話者ダイアリゼーションを行っている。

qiita.com

ざっくりでいいので、サンプル音声なしの例はないかと調べたところ、以下に例があった。

huggingface.co

これらを参考に実装してみる。

音声ファイルとWhisper APIによる解析結果の用意

事前準備として、音声ファイルとそれをWhisper APIのtranscribeにかけた結果のJSONを保存しておく。

注意点として、transcribeする際のオプションとして response_format="verbose_json" を指定しておく。発話ごとに音声ファイルを分割する必要があるが、 verbose_json でないと、発話の開始・終了時間が結果に含まれない。

今回は、それぞれ sample.mp3 、 sample.json という名前でColab ノートブックにアップロード。音声ファイルは約25分、128Kbpsで容量が25MB弱のものを使用。

Pythonパッケージのインストール

speechbrain , pydub , scikit-learn をインストールする。

!pip install speechbrain pydub scikit-learn

Embeddingの計算

speechbrainのサンプルコードだと以下のようにしている。

import torchaudio
from speechbrain.pretrained import EncoderClassifier

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    # GPUを使用するよう設定、デフォルトではCPUが使われる
    run_opts={"device": "cuda"}
)

signal, fs = torchaudio.load("sample.mp3")

ただ、今回使用している音声ファイルを読み込ませると、 OutOfMemoryError: CUDA out of memory. が発生。

以下のように、Whisperの解析結果をもとに、ファイルを分割して読み込ませる必要があった。

!mkdir audio-segments

import json
import torchaudio

from pydub import AudioSegment
from speechbrain.pretrained import EncoderClassifier

def audio_segmentation(audio, segments, format):
    """
    音声ファイルをセグメントに分割して保存し、保存先のファイルパスを配列で返します。
    """
    segment_file_names = []

    for segment in segments:
        # ミリ秒で指定する必要があるため1000倍
        segment_start = segment["start"] * 1000
        segment_end = segment["end"] * 1000

        audio_segment = audio[segment_start:segment_end]
        segment_file_name = f"audio-segments/sample_{segment['id']:03}.mp3"
        audio_segment.export(segment_file_name, format=format)
        segment_file_names.append(segment_file_name)

    return segment_file_names


def segments_to_embeddings(classifier, segment_file_names, format):
    """
    音声ファイルをEmbeddingに変換します。
    """
    embeddings = []

    for segment_file_name in segment_file_names:
        signal, _ = torchaudio.load(segment_file_name, format=format)
        embeddings.append(classifier.encode_batch(signal))

    return embeddings


def main():
    audio = AudioSegment.from_mp3("sample.mp3")
    format = "mp3"
    classifier = EncoderClassifier.from_hparams(
        source="speechbrain/spkrec-ecapa-voxceleb",
        run_opts={"device": "cuda"},
    )

    # Whisper解析結果の読み込み
    with open("sample.json") as f:
        whisper_result = json.load(f)
        segments = whisper_result["segments"]

        segment_file_names = audio_segmentation(audio, segments, format)
        embeddings = segments_to_embeddings(classifier, segment_file_names, format)

main()

初回実行時は、モデルのダウンロードが実行される。

2回目以降では、25分程度のファイルに対して、実行時間は30秒程度。

ただ、処理時間のほとんどはファイル分割および保存によるもので、Embeddingでは3.5秒程度しかかからなかった。ファイルに保存せず、オンメモリで処理できないか調べたが、 torchaudio.load はファイルパスしか受け付けない模様。

クラスタリングの実施

続いて、クラスタリングを行う。

階層的クラスタリングによる教師なし次元削減として、 AgglomerativeClustering や FeatureAgglomeration が使える。今回は AgglomerativeClustering を用いて実装。

import json
import torchaudio

from sklearn.cluster import AgglomerativeClustering
from speechbrain.pretrained import EncoderClassifier


def get_segment_file_names(segments):
    segment_file_names = []

    for segment in segments:
        segment_file_name = f"audio-segments/sample_{segment['id']:03}.mp3"
        segment_file_names.append(segment_file_name)

    return segment_file_names


def segments_to_embeddings(classifier, segment_file_names, format):
    """
    音声ファイルをEmbeddingに変換します。
    """
    embeddings = []

    for segment_file_name in segment_file_names:
        signal, _ = torchaudio.load(segment_file_name, format=format)
        embedding = classifier.encode_batch(signal).detach().cpu().numpy()
        # embedding は(1, 1, 192) の3次元配列
        embeddings.append(embedding.reshape(192,))

    return embeddings


def main():
    format = "mp3"
    classifier = EncoderClassifier.from_hparams(
        source="speechbrain/spkrec-ecapa-voxceleb",
        run_opts={"device": "cuda"},
    )

    # Whisper解析結果の読み込み
    with open("sample.json") as f:
        whisper_result = json.load(f)
        segments = whisper_result["segments"]

        segment_file_names = get_segment_file_names(segments)
        embeddings = segments_to_embeddings(classifier, segment_file_names, format)

        # コンストラクタにクラスタ数を渡すこともできる。デフォルトは2
        clustering = AgglomerativeClustering().fit(embeddings)

        with open("sample-speaker.txt", mode="w", encoding="utf-8") as f:
            current_speaker = clustering.labels_[0]
            spoken_words = []

            for label, segment in zip(clustering.labels_, segments):
                if label != current_speaker:
                    f.write(f"[話者{current_speaker + 1}] {''.join(spoken_words)}\n")
                    spoken_words = []
                    current_speaker = label

                spoken_words.append(segment["text"])

main()

クラスタリングは一瞬で終わる。

コードをまとめる

後ほど使うことを想定し、作業ディレクトリ名の変数化、作業ディレクトリの作成/削除を追加などを行い、まとめてみる。

import json
import logging
import math
import os
import re
import shutil
import torchaudio

from pydub import AudioSegment
from sklearn.cluster import AgglomerativeClustering
from speechbrain.pretrained import EncoderClassifier
from time import perf_counter


work_dir = "audio-segments"

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    run_opts={"device": "cuda"},
)


def audio_segmentation(audio, segments, format):
    """
    音声ファイルをセグメントに分割して保存し、保存先のファイルパスを配列で返します。
    """
    start_time = perf_counter()

    duration = audio.duration_seconds * 1000
    print(f"duration: {duration}")
    segment_file_names = []

    for segment in segments:
        segment_start = segment["start"] * 1000
        segment_end = segment["end"] * 1000

        audio_segment = audio[segment_start:segment_end]
        segment_file_name = f"{work_dir}/segment-{segment['id']:03}.{format}"
        audio_segment.export(segment_file_name, format=format)
        segment_file_names.append(segment_file_name)

    print(segment_file_names[-1])
    return segment_file_names


def segments_to_embeddings(segment_file_names, format):
    """
    音声ファイルをEmbeddingに変換します。
    """
    start_time = perf_counter()

    embeddings = []

    for segment_file_name in segment_file_names:
        signal, _ = torchaudio.load(segment_file_name, format=format)
        embedding = classifier.encode_batch(signal).detach().cpu().numpy()
        embeddings.append(embedding.reshape(192,))

    return embeddings


def speaker_diarization(embeddings, segments):
    start_time = perf_counter()

    clustering = AgglomerativeClustering().fit(embeddings)

    speakers_file_path = "speakers.txt"

    with open("debug.txt", mode="w", encoding="utf-8") as f:
        current_speaker = clustering.labels_[0]
        statements_by_speaker = [] # 話者が連続して発言した内容
        speaker_dialyzed_conversations = [] # 話者ダイアリゼーションされた会話

        def join_statements():
            # 「。」で結合。ファイルによっては「。、?」で text が終わるので、それぞれ調整
            joined_statement = '。'.join(
                statements_by_speaker
            ).replace('。。', '。').replace('、。', '、').replace('?。', '? ')
            return f"話者{current_speaker + 1}:{joined_statement}"

        for label, segment in zip(clustering.labels_, segments):
            f.write(f"{label}: {segment['text']}\n")

            # 話者が交代した場合
            if label != current_speaker:
                statement = "。".join(statements_by_speaker).replace("。。", "。")
                speaker_dialyzed_conversations.append(join_statements())
                statements_by_speaker = []
                current_speaker = label

            statements_by_speaker.append(segment["text"])

        if statements_by_speaker:
            speaker_dialyzed_conversations.append(join_statements())

    with open(speakers_file_path, mode="w", encoding="utf-8") as f:
        f.write("\n".join(speaker_dialyzed_conversations) + "\n")

    return speakers_file_path


def speaker_diarization_by_whisper_json(whisper_json_file_path, audio_file_path):
    with open(whisper_json_file_path) as f:
        whisper_json = json.load(f)
        segments = whisper_json["segments"]

        audio_format = audio_file_path.split(".")[-1]
        audio = AudioSegment.from_file(audio_file_path, audio_format)

        segment_file_names = audio_segmentation(audio, segments, audio_format)
        embeddings = segments_to_embeddings(segment_file_names, audio_format)
        speakers_file_path = speaker_diarization(embeddings, segments)
        print(f"speakers_file_path: {speakers_file_path}")


def main(whisper_json_file_path, audio_file_path):
    speaker_diarization_by_whisper_json(whisper_json_file_path, audio_file_path)


if os.path.isdir(work_dir):
    shutil.rmtree(work_dir)

os.mkdir(work_dir)
main("sample.json", "sample.mp3")

振り返り

わからないなりに何とかなったが、さて実際に運用に乗せるとなると、どんな構成がいいか。

AWS上で動かす前提だが、EC2のGPU インスタンスを動かすと結構高いのと、CUDAなどの環境構築をしたことがない。

SageMakerが使えれば安くなりそうだが、あれは自分で学習させたモデルを使うものだしなぁ。

2023-05-07

JavaScriptで配列の末尾の値を取得したいときはat(-1)が使える

JavaScript Node.js

Node.jsで、ファイル名から拡張子を取るときに、 fileName.split('.').slice(-1)[0] という書き方を同僚がしていた。

Node.jsのバージョンを18にしていたので、コードレビューで、「配列の末尾を取りたいなら .at(-1) でいいよ」と指摘したら、 at を知らなかったのでメモ。

Array.at について

Node.jsの場合、v16.6.0以降であれば使用可能。比較的新しいメソッドだが、v16.6.0のリリースが2021/7/29なので、使えるようになってから2年近く経っている。主要なWebブラウザでも、同じ時期に利用可能となっているため、最新のブラウザであれば問題なく使用可能。

developer.mozilla.org

引数が 0 ～ length の間であれば、ブラケット演算子によるアクセスと同じ値を返す。また、 length よりも大きい引数であれば、ブラケット演算子と同様、 undefined を返す。

結果が異なるのは引数が負数の場合。ブラケット演算子であれば常に undefined を返すが、 at の場合は 引数 + length の値を返す。

配列の末尾の値の取得

at の引数が負数の場合の性質により、空配列でなければ、配列長にかかわらず at(-1) は配列の末尾の値を返す。

console.log([1, 2, 3].at(-1)) // 3
console.log([1, 2].at(-1)) // 2
console.log([1].at(-1)) // 1
console.log([].at(-1)) // undefined

MDNには、わざわざ「配列の末尾の値を返す」例が載っている。「メソッドの比較」として、ブラケット演算子と length を使ったパターンや、 slice を使ったパターンとの比較があるが、 at を使うのが最も分かりやすいだろう。

振り返り

at は末尾を簡単に取得するために生まれたメソッドな気がする。

Javaの IndexOutOfBoundsException のように、配列の範囲外の添え字を指定すると例外が発生するならともかく、JavaScriptの場合、配列長を超える値をブラケット演算子に渡してもエラーにならず undefined が返っているので、負数を指定できること以外、ブラケット演算子と差異がないんだよなあ。

ちなみに、 slice(-1)[0] の書き方をした同僚は、「配列の末尾 javascript」で検索して出てきたQiitaの記事を読んだみたい。

qiita.com

ここも、コメントを最後まで読むと at(-1) がコメントされている。