データセットとは? | IBM
原題: What is a Dataset? | IBM
分析結果
- カテゴリ
- AI
- 重要度
- 66
- トレンドスコア
- 30
- 要約
- データセットとは、通常、テーブルや配列の形で整理されたデータの集合を指します。データセットは、分析や処理のために使用される基本的な単位であり、さまざまな形式や構造を持つことができます。
- キーワード
What is a Dataset? | IBM What is a dataset? Authors Annie Badman Staff Writer IBM Think Matthew Kosinski Staff Editor IBM Think What is a dataset? A dataset is a collection of data typically organized in tables, arrays or specific formats, such as CSV or JSON for easy retrieval and analysis. Datasets are essential for data analysis, machine learning (ML), artificial intelligence (AI) and other applications that require reliable, accessible data . Organizations today collect large amounts of data from various sources, including customer interactions, financial transactions, IoT devices and social media platforms. To unlock the business value of all this data, it must often be organized into datasets: organized collections that make information accessible for analysis and application. Different types of datasets store data in various ways. For instance, structured datasets often arrange data points in tables with defined rows and columns. Unstructured datasets can contain varied formats such as text files, images and audio. While not all datasets involve structured data, they always have some general structure to them, whether defined schemas or loosely organized syntax in semistructured data formats such as JSON or XML. Examples of datasets include: Customer service datasets tracking support interactions and resolutions. Manufacturing datasets monitoring equipment performance metrics. Sales datasets analyzing transaction patterns and consumer behavior. Marketing datasets measuring campaign effectiveness and engagement. Organizations often use and maintain multiple datasets to support various business initiatives, including data analysis and business intelligence (BI) . Big data , in particular, relies on massive, complex datasets to deliver value. When properly collected, managed and analyzed using big data analytics , these datasets can help uncover new insights and enable data-driven decision-making . In recent years, the rise of artificial intelligence (AI) and machine learning have further increased the focus on datasets. Organizations need extensive, well-organized training data to develop accurate machine learning models and refine predictive algorithms. According to Gartner, 61% of organizations report having to evolve or rethink their data and analytics operating model because of the impact of AI technologies. 1 The latest tech news, backed by expert insights Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement . Thank you! You are subscribed. What a dataset is and is not Though the term "dataset" is often used broadly, certain qualities determine whether a collection of data constitutes a dataset. Generally, datasets have 3 fundamental characteristics: variables, schemas and metadata. Variables represent the specific attributes or characteristics being studied within the dataset. For example, in a sales dataset, variables might include product ID, price and purchase date. Variables often serve as inputs for machine learning algorithms and statistical analysis. Schemas define a dataset’s structure, including the relationships and syntax between its variables. For example, a tabular dataset’s schema might outline the dataset’s formats and column headers, such as "date," "amount" and "category." A JSON schema might describe nested data structures such as customer profiles with attributes such as "name," "email" and an array of "order history" objects. Metadata or data about data, provides essential context about the dataset, including details about its origin, purpose and usage guidelines. This information helps ensure that datasets remain interpretable and integrate effectively with other systems. Not all collections of data qualify as datasets. Random accumulations of unrelated data points typically don't constitute a dataset without some proper organization and structure to enable meaningful analysis. Similarly, while application programming interfaces (APIs), databases and spreadsheets can interact with or contain datasets, they are not necessarily datasets themselves. APIs allow applications to communicate with each other, which sometimes involves accessing and exchanging datasets. Databases and spreadsheets are containers for information, which can include datasets. Types of datasets Organizations generally work with 3 main types of datasets, typically classified based on the type of data they handle: Structured datasets Unstructured datasets Semistructured datasets Organizations often use multiple types of datasets in combination to support comprehensive data analytics strategies. For example, a retail business might analyze structured sales data alongside unstructured customer reviews and semistructured web analytics to get better insights into customer behavior and preferences. Structured datasets Structured datasets organize information in predefined formats, typically tables with clearly defined rows and columns. These datasets are foundational to many critical business processes, such as customer relationship management (CRM) and inventory management. Because structured datasets follow consistent schemas, they enable fast querying and reliable analysis. This makes them ideal for business intelligence tools and reporting systems that require precise, quantifiable data. Common examples of structured datasets include: Financial records organized in Excel spreadsheets with defined fields for dates, amounts and categories. Customer databases with standardized formats for contact information and purchase history. Inventory systems tracking product quantities, locations and movement. Sensor data streams providing uniform metrics for equipment monitoring and predictive maintenance . Unstructured datasets Unstructured datasets contain information that doesn't conform to traditional data models or rigid schemas. While these datasets require more sophisticated processing tools, they often contain rich insights that structured data formats cannot capture. Organizations rely on unstructured datasets to power artificial intelligence and machine learning models. These datasets provide the diverse, real-world data needed to train AI models and develop more advanced analytics capabilities. Common examples of unstructured datasets include: Text documents, such as emails, reports and web pages. Images and videos used to train machine learning models. Audio recordings from real-world applications. Chat logs and customer service transcripts. Semistructured datasets Semistructured datasets bridge the gap between structured and unstructured data. While they don't follow rigid schemas, they incorporate defined syntax or markers to help organize information in flexible yet parseable formats. This hybrid approach makes semistructured datasets valuable for modern data integration projects and applications that need to handle diverse data types while maintaining some organizational structure. Common examples of semistructured datasets include: JSON, HTML and XML files used in web applications and APIs. Log files containing both formatted fields and free-form text. Public datasets combining multiple data formats for broader accessibility. AI Academy Is data management the secret to generative AI? Explore why high-quality data is essential for the successful use of generative AI. Go to episode Sources of datasets Organizations collect data from multiple sources to build datasets that support various business initiatives. Data sources can directly determine both the quality and utility of datasets. Some common data sources include: Data repositories Databases Application programming interfaces (APIs) Public data platforms Data repositories Data repositories are centralized stores of data. Proprietary data repositories often house sensitive or business-critical data such as customer records, financial transactions or operational metrics that provide competitive advantages. Other data repositories are publicly available. For example, a platform such as GitHub hosts open source datasets alongside code. Researchers and organizations can use these public datasets to collaborate openly on machine learning models and data science projects. Databases Databases are digital data repositories optimized for securely storing and easily retrieving data as needed. A database can contain a single dataset or multiple datasets. Users can quickly extract relevant data points by running database queries that use specialized languages such as structured query language (SQL) . Application programming interfaces (APIs) APIs connect software applications so they can communicate. Data consumers can use APIs to capture data in real time from connected sources, such as web services and digital platforms, and funnel it to other apps and repositories for use. Data scientists often build automated data collection pipelines by using languages such as Python, which offers robust libraries for API integration and data processing. For example, a retail analytics system might use these automated pipelines to continuously gather customer purchase data and inventory levels from e-commerce stores and inventory management systems. Public data platforms Sites such as Data.gov and city-level open data initiatives such as New York City Open Data provide free access to datasets that include healthcare, transportation and environmental metrics. Researchers can use these datasets to study everything from transportation patterns to public health trends. Dataset use cases From powering artificial intelligence to enabling data-driven insights, datasets are foundational to several key business and technological initiatives. Some of the most common applications of datasets include: Artificial intelligence (AI) and machine learning (ML) Data analysis and insights Business intelligence (BI) Artificial intelligence (AI) and machine learning (ML) Artificial intelligence